AD-AO30  330 


unclassified 


ILLINOIS  UNIV  AT  URB ANA-CHAMPAIGN  COORDINATED  SCIENCE  LAB  F/G  9/2 
IMPROVING  THE  THROUGHPUT  OF  PIPELINES  KITH  DELAYS  AND  BUFFERS. <U> 
OCT  70  J M PATEL  DAAB07-72-C-02S9 


AD  A 038  33  8 


n/nyrtfitiTi 


» COORDINATED  SCIENCE  LABORATORY 


IMPROVING  THE  THROUGHPUT 
OF  PIPELINES  WITH 
DELAYS  AND  BUFFERS 


.SECURITY  CLASSIFICATION  OF  THIS  PAGE  (TMimi  Dm  Enfrtd) 


REPORT  DOCUMENTATION  PAGE 


REPORT  NUMBER 


BAf»P  read  instructions 

BEFORE  COMPLETING  FORM 

2.  GOVT  ACCESSION  NO.  3-  RECIPIENT’S  CATALOG  NUMBER 


title  (and  Subtiti a) 


IMPROVING  THE  THROUGHPUT  OF  PIPELINES  WITH 
DELAYS  AND  BUFFERS  t I 


7.  AUTHORfiJ 

v : ;■ 


'M- 


9 performing  organization  name  and  address 
Coordinated  Science  Laboratory 
University  of  Illinois  at  Urbana-Champaign 
Urbana,  Illinois  61801 

II.  CONTROLLING  OFFICE  NAME  AND  AOORESS 

Joint  Services  Electronics  Program 


.TYPE  OF  REPORT  A PERIOD  COVERED 


7 Technical  /epmlt o ( 

' V 4 PERFORMING  UHC.  HbPOWrWUMBpR 

///l  R-747i  UILU-ENG-76-2235  ^ 

MCS  ffi-03488  API 
7^^DAAR^07-72-C-^259 

^ ^10.  PROGRAM  EL EMEnY.  PffWE5f\  TASK 
AREA  8 WORK  UNIT  NUMBERS  A 

( - ?5-<P  3V  % 


3.  NUMBER  OF 


U MONITORING  AGENCY  NAM*  8 ADDRESS^//  different  from  Controlling  Ottl ca)  | 15.  SECURITY  CLASS,  (ot  thla  report) 


UNCLASSIFIED 

IS*.  DECLAS3IFIC  ATI  ON/ DOWN  GRADING 
SCHEDULE 


I 16  DISTRIBUTION  STATEMENT  (ol  thl,  Rmport) 


Approved  for  public  release;  distribution  unlimited 


|Tr  DISTRIBUTION  STATEMENT  (ot  the  abatract  antarad  in  Block  20,  it  dlttarant  from  Report) 


18.  SUPPLEMENTARY  NOTES 


1 19  KEY  WORDS  ( Continue  on  ravaraa  aide  it  nacaaaary  and  Identify  by  block  number) 


Pipelined  Computers 
Buffered  Systems 
\ Parallel  Processing 


2ft>B^^STR ACT  (Continue  on  ravaraa  aide  It  nacaaaary  and  Identify  by  block  number) 

A pipeline  as  defined  here  is  a collection  of  segments  of  hardware  which 
can  operate  simultaneously.  A task  flows  synchronously  from  segment  to  segment 
for  its  execution.  Each  task  follows  one  of  several  distinct  task  flow  patterns 
which  are  assumed  to  be  fixed  and  known  in  advance. 

It  is  characteristic  of  pipelines  that  a task  can  be  initiated  in  the 
pipeline  before  an  earlier  initiated  task  has  completed  its  execution.  A problen 
arises  when  two  or  more  tasks  try  to  use  the  same  segment  at  the  same  time, 
causing  a collision.  Such  collisions  can  be  avoided  by  appropriate  scheduling 


DD  , 1473  EOITION  OF  I NOV  69  1$  OBSOLETE 

097  7 <70 


SECURITY  CLASSIFICATION  of  This  PAGE  flWi.n  Dai*  Entered) 


SICUWlTV  CLASSIFICATION  of  THIS  P AOipFhan  Data  IntaratO 


20.  ABSTRACT  (continued) 

of  tasks  and/or  modification  of  the  pipeline;  or  the  collisions  can  be  resolved 
at  the  time  and  place  they  occur.  These  alternatives  are  studied  in  this  work 
with  the  objective  that  the  throughput,  i.e.,  average  number  of  tasks  processed 
per  unit  time,  be  maximized  while  providing  considerable  flexibility  in 
scheduling  of  tasks. 

Collision  characteristics  of  task  schedules  and  pipelines  are  studied. 

A methodology  for  inserting  fixed  delays  in  a pipeline  to  achieve  desired 
collision  characteristics  is  presented. 

Collisions  can  be  resolved  by  lettingbne  task  use  the  segment  and  storing 
the  rest  of  the  competing  tasks  in  a buffer^.  Several  priority  schemes  for  the 
buffers  have  been  investigated,  analytically  for  periodic  task  arrivals  and 
experimentally  for  random  arrivals.  The  characteristics  studied  are,  queue 
size,  wait  time  and  throughput. 

It  is  shown  that  the  theoretical  maximum  throughput  of  a pipeline  is 
attainable  with  the  use  of  fixed  delays  or  buffers.  Moreover  it  is  shown 
that  a substantial  degree  of  freedom  in  scheduling  of  tasks  is  achieved  by 
these  methods. 


SCCUNlTy  CLASSIFICATION  Of  THIS  FA0lfirj>.n  Dal*  Bnl;»d) 


1 

I 


■ ■ ' I 


This  work  was  supported  in  part  by  the  National  Science 
Foundation  under  Grant  MCS  73-03488  A01  and  in  part  by  the  Joint 
Services  Electronics  Program  (U.S.  Amy,  U.S.  Navy  and  U.S.  Air 
Force)  under  Contract  EAAB-07-72-C-0259. 


Reproduction  in  whole  or  in  part  is  permitted  for  any 
purpose  of  the  United  States  Government. 


Approved  for  public  release.  Distribution  unlimited. 


D D C 


D 


I 

I 

ACKNOWLEDGMENTS 

I wish  to  express  my  gratitude  to  my  advisor  Dr.  Edward  S. 
f Davidson  for  his  guidance,  encouragement  and  genuine  concern  throughout 

I the  course  of  this  research.  These  years  of  association  with  Dr. 

Davidson  will  always  be  remembered  and  valued. 

| I also  wish  to  express  my  appreciation  to  Mrs.  Rose  Harris 

for  her  immaculate  typing  of  this  report, 
f Finally,  I wish  to  thank  all  my  colleagues  at  Stanford 

I Electronics  Laboratories  and  Coordinated  Science  Laboratory  for 

■ providing  an  intellectually  stimulating  atmosphere. 


I 

I 


HMSHMph 


IMPROVING  THE  THROUGHPUT  OF  PIPELINES  WITH  DELAYS  AND  BUFFERS 


Janak  H.  Patel,  Ph.D. 
Coordinated  Science  Laboratory 


A pipeline  as  defined  here  is  a collection  of  segments  of  hard- 
ware which  can  operate  simultaneously.  A task  flows  synchronously  from 
segment  to  segment  for  its  execution.  Each  task  follows  one  of  several 
distinct  task  flow  patterns  which  are  assumed  to  be  fixed  and  known  in 
advance . 

It  is  characteristic  of  pipelines  that  a task  can  be  initiated  in 
the  pipeline  before  an  earlier  initiated  task  has  completed  its  execution. 

A problem  arises  when  two  or  more  tasks  try  to  use  the  same  segment  at  the 
same  time,  causing  a collision.  Such  collisions  can  be  avoided  by  appro- 
priate scheduling  of  tasks  and/or  modification  of  the  pipeline;  or  the 
collisions  can  be  resolved  at  the  time  and  place  they  occur.  These  alter- 
natives are  studied  in  this  work  with  the  objective  that  the  throughput, 
i.e.,  average  number  of  tasks  processed  per  unit  time,  be  maximized  while 
providing  considerable  flexibility  in  scheduling  of  tasks. 

Collision  characteristics  of  task  schedules  and  pipelines  are 
studied.  A methodology  for  inserting  fixed  delays  in  a pipeline  to  achieve 
desired  collision  characteristics  is  presented. 

Collisions  can  be  resolved  by  letting  one  task  use  the  segment 
and  storing  the  rest  of  the  competing  tasks  in  a buffer.  Several  priority 
schemes  for  the  buffers  have  been  investigated,  analytically  for  periodic 


task  arrivals  and  experimentally  for  random  arrivals.  The  characteristics 
studied  are,  queue  size,  wait  time  and  throughput. 

It  is  shown  that  the  theoretical  maximum  throughput  of  a pipeline 
is  attainable  with  the  use  of  fixed  delays  or  buffers.  Moreover  it  is 
shown  that  a substantial  degree  of  freedom  in  scheduling  of  tasks  is 
achieved  by  these  methods. 


TABLE  OF  CONTENTS 


1 . INTRODUCTION 

1 . 1 Background 

1.2  Pipeline  Terminology 

1.3  Review  of  Related  Work 

1.4  Objectives  of  this  Research 

1.5  An  Overview  of  this  Work 

2.  PIPELINES  WITH  DELAYS 

2.1  Introduction 

2.2  The  Allowability  of  Pipelines  and  Cycles 

2.3  Constant  latency  Cycles 

2.4  Sharing  of  Noncompute  Segments 

2.5  Defining  the  Optimum  Solution 

2.6  Formulation  of  the  Optimization  Problem 

2.7  Obtaining  the  Optimum  Solution 

2.8  Looping  Pipelines 

2.9  Arbitrary  Cycles 

2.10  Multifunction  Pipelines 

2.11  Concluding  Remarks 

3.  INTERNAL  BUFFERS  WITH  PERIODIC  ARRIVALS 

3.1  Introduction 

3.2  Priority  Schemes 

3.3  Bounds  on  Queue  Size  and  Wait  Time  in 

LIFO-Global  Priority 

3.4  Priority  Schemes  Based  on  Processing  State 

3.5  Bounds  on  Queue  Size  and  Wait  Time  with  Precedence - 
Implied  Priority 

3.6  Concluding  Remarks 

4.  INTERNAL  BUFFERS  WITH  RANDOM  ARRIVALS 

4.1  Introduction 

4.2  Experimental  Investigation 

4.3  Observations  on  the  Experimental  Results 

4.4  Concluding  Remarks 

5.  DESIGNING  OF  PIPELINES 

5 . 1 Introduction 

5.2  A Design  Methodology 

5.3  Summary 


iv 


Page 


1 

7 

12 

15 

15 


17 

19 

24 

28 

32 

36 

41 

45 

53 

78 

87 


90 

94 

101 

118 

125 

140 


142 

146 

147 
157 


160 

161 

184 


* 


6. 


concluding  remarks 


6.1 

6.2 


Conclusions 

Suggestions 


and  Summary 
for  Further 


of  Results 

Research 


l1St  of  references 


1 

' 

1 

LIST  OF  TABLES 

f 

Page 

4.2.1 

| 

4.3.1 

Pipelines  used  for  simulation 

148 

Results  of  simulation 

150 

LIST  OF  FIGURES 

1.1.1 

Example  of  a single  function  pipeline 

3 

1.1.2 

A pipeline  with  parallel  computation  and  feedback 

5 

1.1.3 

Reservation  table  of  a multifunction  pipeline 

6 

1.2.1 

Reservation  table  of  a single  function  pipeline 

9 j 

i 2.1.1 

Delaying  parallel  computation  steps  in  a pipeline 

18 

2.3.1 

Making  a pipeline  allowable  with  respect  to  cycle 

(4) 

27 

! 2.4.1 

Assignment  of  elemental  delays  to  noncompute  segments 

31 

j 2.5.1 

An  optimum  solution  to  the  problem  of  making  the  pipeline 
of  Fig.  2.3.1a  allowable  with  respect  to  cycle  (4) 

33 

2.6.1 

Reservation  table  for  Example  2.6.1 

38 

• 

• 2.6.2 

Reservation  table  for  Example  2.6.2 

38 

1 

2.7.1 

A branch-and-bound  search  for  optimum  solutions 

44 

I 

1 

1 

2.8.1 

I 

Making  a looping  pipeline  allowable  with  respect  to  a 
constant  latency  cycle 

46 

f 

1 

fl 

' 2.9.1 

1 

Generating  maximal  compatibility  classes  of  cycle 
using  Algorithm  2.9.1 

(1,9) 

63 

1 

"Si 

1 

\ 2.9.2 

Generating  maximal  compatibility  classes  of  cycle 
using  Algorithm  2.9.2 

(1,9) 

66 

1 

* 2.9.3 

Making  a pipeline  allowable  with  respect  to  cycle 

(2,3,4) 

74 

1 

. 2.9.4 

Making  a pipeline  allowable  with  respect  to  cycle 

(2,3,4) 

77 

1 1 
• 

W 

1 

1 



vi 

^ - ■ ■ - - -- 

• 

2.10.1 

Usage  intervals  of  a multifunction  pipeline 

79 

2.10.2 

Generating  the  maximal  compatibility  classes  of  cycle 

<lA*lB*2A> 

84 

3.1.1 

Reservation  table  for  Example  3.1.1 

93 

3.2.1 

A pipeline  with  internal  buffers 

95 

3.2.2 

Flow  of  tasks  in  a buffered  pipeline  with  FIFO-global 
priority 

98 

3.2.3 

A hypothetical  queue  behavior  in  a FIFO-global 
pipeline 

100 

3.3.1 

Flow  of  tasks  in  a buffered  pipeline  with  LIFO-global 
priority 

102 

3.3.2 

Example  showing  tightness  of  bounds  on  queue  size  and 
expected  wait  time  with  LIFO-global  priority 

114 

3.3.3 

Example  showing  the  tightness  of  bound  on  maximum  wait 
time  with  LIFO-global  priority 

116 

3.4.1 

A single  function  pipeline  with  one  buffer  per 
computation  step 

119 

3.4.2 

A multifunction  pipeline  with  one  buffer  per 
computation  step 

121 

3.5.1 

Precedence-priority  graph  of  a multifunction  pipeline 

128 

3.5.2 

Precedence-priority  graphs  with  LPF  and  MWRF 
priorities 

129 

3.5.3 

Examples  of  ppn  assignment  in  precedence  priority 
graphs 

132 

4.3.1 

Plot  of  expected  wait  time  w^  vs.  workload 

151 

4.3.2 

Plot  of  expected  wait  time  w^  vs.  buffer  size 

152 

4.3.3 

Plot  of  expected  wait  time  w 2 vs.  workload 

153 

4.3.4 

Plot  of  expected  wait  time  w^  vs.  buffer  size 

154 

5.2.1 

A computation  graph  of  function  Z 

163 

vii 


•* 

1 

1 

1 

5.2.2 

A computation  graph  of  function  Z 

164 

1 

5.2.3 

Reservation  tables  of  Figures  5.2.2 

169 

| 

5.2.4 

Pipeline  with  lower  bound  8 made  allowable  with 

respect  to  cycle  (8) 

171 

T 

5.2.5 

Three  distinct  assignments  of  four  multiply  operations 

to  segments  Sq  and 

173 

5.2.6 

Pipelines  of  Figure  5.2.5  made  allowable  with  respect 

to  cycle  (4) 

175 

5.2.7 

Pipelines  of  Figure  5.2.5  made  allowable  with  respect 

to  cycle  (2,6) 

176 

5.2.8 

Pipeline  allowed  by  cycle  (2) 

178 

\ 

5.2.9 

Multiply  segments  divided  to  obtain  a lower  bound  of  1 

181 

' “V 

•»  • 

r? 

* 

■4 

N 

| 

.. 

• • * 

4 

f 

: / 

1 

1 

1 

viii 

1 

• 

•/  ' 

V MOMto 

mmem  ■*«:■*■ i — ■: — - 

•■v  ^ ! : - ■ 

1 


Chapter  1 
INTRODUCTION 

1.1.  Background 

Present  day  computers  however  powerful  they  may  be,  are  still 
inadequate  for  some  problems.  Typical  of  such  problems  is  global 
weather  modeling.  Indeed,  there  always  will  be  problems  which  will 
require  more  and  more  powerful  machines.  There  are  three  alternatives 
(not  necessarily  mutually  exclusive)  to  obtain  higher  computational 
capability.  They  are,  higher  speed  electronic  components,  parallel 
processing  and  pipeline  processing.  It  is  the  pipeline  or  overlap 
processing  which  is  the  subject  matter  of  this  research. 

The  concept  of  pipelining  is  an  old  one.  The  bucket 
brigade  used  to  fight  fires  of  yesteryear  was  a pipeline.  The  assembly 
line  used  in  manufacturing  plants  is  an  example  of  pipelines. 

Consider  an  automobile  assembly  line  with  n workstations.  A partially 
assembled  automobile  moves  from  one  workstation  to  another.  Each 
workstation  does  a specific  job  on  an  auto.  Assume  that  each  work- 
station operates  on  an  auto  for  t minutes.  Then  a finished  car  leaves 
the  line  every  t minutes  even  though  the  total  time  to  assemble  a car 
is  t times  the  number  of  workstations.  The  same  concept  is  utilized 
in  computational  pipelines. 

A pipeline  can  be  defined  as  a collection  of  resources  called 
segments  which  can  be  kept  busy  simultaneously.  A task  once  initiated 
flows  from  segment  to  segment  for  its  execution,  each  segment 


performing  a specific  suboperation.  The  flow  of  each  task  is  pre- 
determined and  fixed.  A task  once  initiated  must  flow  synchronously 
without  preemption  or  wait,  unless  internal  buffers  are  provided 
between  segments . 

We  term  a pipeline  in  which  all  the  tasks  have  identical 
flow  pattern,  a single  function  pipeline.  In  a multifunction  pipeline 
there  are  two  or  more  distinct  flow  patterns  and  each  task  uses  one  of 
these  flow  patterns. 

A schematic  example  of  a single  function  pipeline  appears  in 
Figure  1.1.1a.  Each  segment  has  an  associated  output  register  or 
latch.  The  output  of  the  segment  is  transferred  to  this  register  when 
an  appropriate  control  signal  is  received.  The  time  shown  in  each  box 
is  the  time  to  perform  that  particular  subtask  and  includes  the  time  to 
latch  the  output.  For  this  illustration,  let  us  ignore  any  memory 
conflicts  between  instruction  fetch  and  operand  fetch.  The  total  time 
to  process  an  instruction  in  this  example  is  1800  ns.  If  no  overlap 
between  instructions  is  provided  then  the  maximum  processing  rate  is 
one  instruction  per  1800  ns.  However,  with  overlap  permitted  it  is 
easy  to  see  that  one  instruction  every  600  ns  can  be  processed. 

Thus  the  effectiveness  of  pipelining  lies  in  the  fact  that 
a task  can  be  initiated  before  the  completion  of  some  previously 
initiated  tasks,  resulting  in  high  performance.  Moreover,  the  segments 
can  be  special  rather  than  general  purpose,  resulting  in  low  cost. 

Thus  due  to  their  higher  performance  per  unit  cost  pipelines  are 
becoming  increasingly  common  in  several  recent  and  proposed  computer 


4 


systems.  Familiar  among  the  pipeline  computers  are  IBM  360/91 
[AND67a,AND67b] , Texas  Instruments-ASC  LWAT72 ,STE73] , CDC  7600  [BON69] 
and  CDC  STAR- 100  [HIN72] . Some  others  are  described  in  [ IBB72 ,SHA74 , 
PAR74]  . 

It  is  convenient  to  define  a basic  time  unit  for  segment 
execution  times.  For  example,  if  the  time  unit  is  taken  as  200  ns  in 
Figure  1.1.1a  then  segment  times  for  instruction  fetch  and  operand 
fetch  are  each  3 units;  for  instruction  decode,  1 unit;  and  for 
instruction  execution,  2 units.  In  general,  the  time  unit  is  taken  to 
be  equal  to  the  greatest  common  divisor  of  the  execution  times  of  all 
segments.  The  system  clock  is  defined  in  terms  of  this  basic  time 
unit  where,  one  clock  cycle  equals  one  time  unit.  This  clock 
synchronizes  the  data  transfers  from  segment  to  segment.  This 
synchronization  is  required  in  our  model. 

The  flow  in  a pipeline  can  be  described  more  conveniently 
by  a reservation  table  such  as  Figure  1.1.1b.  Rows  of  a reservation 
table  correspond  to  segments  and  columns  correspond  to  units  of  time. 

The  function  type  denoted  by  a single  capital  letter  is  placed  in 
row  i and  column  j (cell  (i,j))  of  the  reservation  table  if  after  j 
units  of  execution  a task  of  that  function  type  requires  segment  S^. 

We  shall  use  X as  a function  type  in  single  function  pipelines. 

Figure  1.1.1b  is  the  reservation  table  of  the  pipeline  of  Figure  1.1.1a. 
The  schematic  of  a slightly  more  complex  pipeline  is  given  in  Figure 
1.1.2a,  its  reservation  table  appears  in  Figure  1.1.2b.  Figure  1.1.3 
is  an  example  of  a multifunction  pipeline  with  two  function  types 


A and  B.  As  yet  we  have  not  put  any  restriction  on  the  flow  patterns. 

We  may  impose  some  minor  restrictions  on  permissible  flow  patterns 
later  on.  However,  no  restrictions  are  assumed  on  the  reusages  or 
sharing  of  a segment. 

Reusage  or  sharing  of  a segment  is  identified  in  a reserva- 
tion table  by  multiple  entries  of  x's  or  other  function  types  in  a 
single  row.  It  is  the  reusage  or  sharing  of  segments  which  poses  a 
problem,  namely,  two  or  more  tasks  may  attempt  to  use  a segment  at  the 
same  time  causing  a collision.  For  the  proper  operation  of  the  pipeline 
either  the  task  initiations  must  be  properly  scheduled  so  that  no 
collisions  occur  or  internal  buffers  between  segments  must  be  provided 
so  that  all  but  one  of  the  tasks  are  preempted.  Previous  researchers 
have  concentrated  their  efforts  on  finding  a schedule  for  initiations 
which  causes  no  collisions  and  gives  a high  throughput.  Before  we 
outline  their  work  some  terminology  is  necessary. 


1.2.  Pipeline  Terminology 

A computation  step  is  an  indivisible  operation  by  a segment 
on  a task.  Thus  in  Figure  1.1.2  the  first  computation  step  is  2 time 
units  long;  the  next  step  is  one  unit  long.  Two  parallel  computation 
steps  on  segments  S£  and  are  each  one  unit  long  and  two  successive 
steps  on  are  each  one  time  unit.  A knowledge  of  computation  steps, 
which  distinguishes  between  Sq  and  usage,  is  relevant  to  considering 


possible  pipeline  redesign. 


A usage  interval  of  a segment  is  defined  to  be  the  time 


interval  between  two  reservations  (X's)  of  that  segment.  For  example 
in  Figure  1.2.1  all  usage  intervals  of  are  2,  3 and  5.  Let  F denote 
the  set  of  all  usage  intervals  of  a pipeline;  e.g.,  F = {l,2,3,5}  for 
Figure  1.2.1. 

An  initiation  interval  of  a pair  of  task  initiations  is 
simply  the  interval  in  time  units  between  the  two  time  instants  of 
those  initiations.  A latency  is  the  initiation  interval  of  two 
successive  initiations. 

A sequence  of  task  initiations  can  be  completely  described  by 
a sequence  of  latencies.  For  example,  task  initiations  at  time 
instants  0,  3,  5,  9 and  12  can  be  described  by  the  latency  sequence 
(3, 2, 4, 3).  Let  G denote  the  set  of  all  initiation  intervals  of  an 
initiation  sequence. .^Ihus  G for  latency  sequence  (3, 2, 4, 3)  is 
{2,3,4,5,6,7,9,12}. 

Clearly,  a collision  occurs  between  two  tasks  if  their 
initiation  interval  equals  one  of  the  usage  intervals  of  the  pipeline 
[DAV71]  . 

For  multifunction  pipelines,  the  usage  intervals  and  initia- 
tion intervals  are  defined  similarly.  Let  F^  be  the  set  of  usage 
intervals  of  all  (x,Y)  pairs  in  the  reservation  table.  This  set  can 
be  formed  by  taking  all  pairwise  distances  between  an  X and  a Y which 
appears  to  the  right  of  the  X in  the  same  row.  Here  we  use  X and  Y as 
variables,  they  take  function  types  as  their  values;  these  values  need 
not  be  distinct.  Thus  for  Figure  1.1.3,  the  following  sets  of  usage 


intervals  exist: 


V ■ U1>  E*b  ■ fo-1-2-45.  4a*10’1-2’-  E 


H3B 


= (2}. 


A multifunction  initiation  sequence  can  be  described  by  a 

/ 

latency  sequence,  where  each  latency  is  subscripted  by  the  type  of  the 
task  being  initiated.  Thus,  for  example,  initiations  of  tasks  of  type  A 
at  time  instants  0,3,5;  of  tasks  of  type  B at  instants  4,5;  and  of  type  C 
at  time  instants  7,9  can  be  described  by  the  latency  sequence, 


^"A’WVW2^' 


is  just  a convenient  way  of  saying  that  the  first 
initiation  is  of  type  A.  Two  tasks  of  type  A and  B are  being  initiated 
simultaneously  at  time  instant  5 and  therefore  the  latency  between  these 
tasks  is  0;  in  the  latency  sequence  they  can  be  written  in  either  order. 
A latency  of  0 between  two  tasks  of  the  same  type  is  not  permissible. 

In  a multifunction  pipeline  a collision  occurs  if  an  interval 
between  a task  of  type  X and  an  earlier  initiated  task  of  type  Y is 
equal  to  the  one  of  the  usage  intervals  of  the  pair  (x,Y)  in  any  row  of 
the  reservation  table.  We  shall  present  the  formulation  of  initiation 
intervals  of  a multifunction  sequence  later  in  Section  2.10. 

If  a subsequence  of  latencies  repeats  itself  in  an  infinite 
sequence,  then  the  subsequence  is  termed  an  initiation  cycle  or  simply 
a cycle.  Thus  a cycle  (1^,3B>2C>3^)  implies  an  initiation  sequence 
(•••  l^,3g,2^,3^, •***).  In  a single  function 
pipeline,  a cycle, for  example  (2, 3, 2, 5), implies  a latency  sequence. 


11 


( 2, 3, 2, 5, 2, 3, 2, 5, 2, 3, 2, 5, 2, 3 A cycle  with  only  one  latency 

is  called  a constant  latency  cycle,  e.g.,  cycle  (4). 

An  initiation  cycle  is  said  to  be  allowable  with  respect  to 
a pipeline,  if  no  collisions  occur  when  the  cycle  is  followed  for  task 
initiations;  conversely  we  also  say  that  the  pipeline  is  allowable  with 
respect  to  the  cycle.  The  definition  of  allowability  also  applies  at 
finer  levels.  For  example,  we  say  that  an  initiation  interval  between 
two  tasks  is  allowed  by  a pipeline  if  no  collision  occurs;  and  a usage 
interval  is  allowed  by  a cycle  if  no  collision  occurs  due  to  that 
particular  usage  interval. 

The  period  p of  a cycle  is  defined  as  the  sum  of  the 
latencies  of  the  cycle.  The  average  latency  i of  a cycle  is  the 

cl 

average  of  the  latencies  in  a cycle.  For  example,  the  period  p of 
cycle  (2, 3, 2, 5)  is  12  and  average  latency  H is  12/4  = 3.  Thus  cycle 
(2, 3, 2, 5)  implies  an  average  initiation  rate  of  one  task  every  3 time 
units.  If  there  is  no  storage  in  the  pipeline  then  the  output  rate 
equals  input  rate  once  the  first  output  occurs.  Thus  in  a pipeline 
without  buffers,  the  throughput  is  just  the  reciprocal  of  the  average 
latency  of  the  cycle. 

Segment  utilization  is  defined  in  terms  of  the  fraction  of 
the  time  the  segment  remains  busy.  Clearly,  the  maximum  possible 
utilization  of  any  segment  is  100%. 

In  a single  function  pipeline  if  m is  the  maximum  number  of 
X's  in  any  row  of  the  reservation  table  then  there  is  a segment  which 
remains  busy  for  m time  units  for  every  task  initiated.  Thus  no  more 


Ni  It 


'W  . 


12 


than  one  task  every  m time  units  can  be  processed  by  the  pipeline  since 
the  average  latency  of  m implies  a 100%  utilization  of  some  segment. 
Thus,  m is  called  the  lower  bound  latency  of  the  pipeline. 

A single  function  pipeline  is  called  a straight  through 
pipeline  if  a task  is  processed  by  each  segment  exactly  once. 

Figure  1.1.1  is  an  example  of  such  a pipeline.  Characteristic  of  such 
a pipeline  is  the  absence  of  any  feedback  or  sharing. 


1.3.  Review  of  Related  Work 

The  pipeline  form  considered  here  was  first  proposed  by 
Davidson  [DAV71,74,75] . He  has  analyzed  single  function  pipelines  in 
detail.  Subsequently,  Shar  [SHA72]  and  Winslow  [WIN73]  have  analyzed 
exclusively  the  single  function  pipelines  and  Thomas  [TH074]  has 
analyzed  exclusively  multifunction  pipelines.  Mayeda  [MAY75]  has 
suggested  an  interesting  approach  using  graph  theory.  They  all  have 
one  thing  in  common.  They  have  concentrated  their  efforts  on  finding 
allowable  initiation  cycles  with  lowest  possible  average  latency  for  a 
given  pipeline.  Davidson  gave  a branch-and -bound  algorithm  for  finding 
such  cycles  for  a given  pipeline.  As  yet  no  improved  algorithms  have 
been  found. 

A simple  control  mechanism  for  collision-free  task  initiations 
was  devised  by  Davidson.  In  a single  function  pipeline,  this  control 
scheme  required  one  shift  register.  Davidson's  shift-register- 
controller  initiates  a waiting  task  as  soon  as  allowed  by  the  pipeline; 


13 


that  is,  as  soon  as  a collision-free  flow  can  be  guaranteed  through  the 
pipeline.  This  initiation  strategy  is  called  greedy.  Greedy  strategy 
minimizes  the  current  latency  rather  than  the  average  latency  in  the 
long  run.  Shar  and  Winslow  have  studied  the  greedy  initiation  strategy 
in  great  detail  [ SHA72 ,WIN73] . Any  cycle  which  results  from  a greedy 
initiation  sequence  is  called  a greedy  cycle.  For  a single  function 
pipeline  the  following  inequalities  are  satisfied. 

maximum  no.  minimum  average  average  latency  (size  of 

of  X's  in  any  latency  of  any  of  any  greedy  < the  usage 

row  of  the  allowable  cycle  cycle  ~ interval 

reservation  table  set)  + 1 

Multifunction  pipelines  are  somewhat  complicated.  Davidson's 
shift-register-controller  is  still  applicable  for  greedy  initiations, 
but  the  number  of  shift-registers  required  is  equal  to  the  number  of 
distinct  functions.  Moreover,  the  controller  must  also  have  a random 
access  to  a table  of  constants.  The  table  has  number  of  bits  propor- 
tional to  the  square  of  the  number  of  functions.  However,  the  controller 
is  still  relatively  simple  and  inexpensive,  considering  the  complexity 
of  a multifunction  pipeline.  Thomas  has  given  an  algorithm  to  find  a 
minimum  set  of  "good"  allowable  cycles  of  a multifunction  pipeline 
[TH074] . For  any  given  function  mix,  a linear  combination  of  good 
cycles  exists  which  produces  the  maximum  throughput  possible  with  any 
collision-free  strategy. 

The  principal  limitations  of  the  above  scheduling  strategies 


■PM 
2*. 


are  as  follows  : 


14 


1.  The  assumption  that  the  initiations  be  allowable  is 
somewhat  restrictive.  In  many  situations,  there  are  no  allowable 
cycles  which  would  utilize  a pipeline  to  its  fullest  extent,  i.e. 
utilize  some  segment  100%. 

2.  The  requirements  for  flexibility  in  scheduling  and  high 
throughput  are  contradictory.  For  example,  greedy  control  in  multi- 
function pipelines  gives  sufficient  flexibility  in  initiations. 

However,  some  simulation  studies  have  shown  that  maximum  segment 
utilization  can  easily  be  as  low  as  50%.  On  the  other  hand,  if  an 
allowable  cycle  with  low  average  latency  is  used  then  any  changes  in 
the  arrival  pattern  which  do  not  conform  to  the  cycle  cannot  be 
tolerated . 

In  other  related  work,  Larson  [LAR73]  describes  a methodology 
for  designing  straight  through  pipelines.  He  gives  some  guidelines 
for  optimally  dividing  a functional  unit  or  resource  into  segments. 
There  also  exists  some  work  at  the  system  level  and  some  at  the  logic 
level  l CHE71,FLY72, HAL72 ,TOM67]  . 

Other  seemingly  related  work  deals  with  job-shop  scheduling. 
In  operations -research  terminology  our  pipeline  is  a job-shop  where 
segments  correspond  to  machines  and  tasks  to  jobs.  The  job-shop 
scheduling  algorithms,  however,  are  inapplicable  to  the  pipeline  model 
considered  here  for  two  reasons.  First,  job-shops  normally  assume 
infinite  internal  buffers,  with  few  exceptions  IRED73] , which  is 
impractical  for  computational  pipelines.  Second,  the  complexity  of 
the  job-shop  scheduling  algorithm  is  combinatorial  in  the  number  of 


A. 


15 


tasks  and  segments,  and  thus  it  is  practically  impossible  to  schedule 
more  than  a few  hundred  tasks,  while  the  number  of  tasks  (e.g. 
instructions  of  a computer)  executed  in  a pipeline  might  run  to  some 
millions  per  second.  We  close  with  a note  that  even  though  job-shop 
scheduling  cannot  be  successfully  applied  to  pipelines,  pipeline 
scheduling  can  be  applied  to  many  practical  job-shops,  e.g.,  an  assembly 
line.  Pipeline  scheduling  takes  advantage  of  the  pipeline  structure 
and  function  flow  in  scheduling  and  thus  has  a complexity  related 
primarily  to  the  number  of  distinct  functions  rather  than  to  the  number 
of  tasks  to  be  scheduled. 


1.4.  Objectives  of  this  Research 

1.  To  improve  the  throughput  of  a pipeline  by  appropriate 
scheduling  and/or  modification  of  the  pipeline. 

2.  To  provide  considerable  flexibility  in  scheduling  without 
adversely  affecting  the  throughput  of  the  pipeline. 

3.  To  provide  some  methodology  for  designing  a cost-effective 

pipeline. 


1.5.  An  Overview  of  this  Work 

In  Chapter  2 we  present  a methodology  for  inserting  fixed 
delays  in  a pipeline  so  that  the  modified  pipeline  allows  a given  cycle. 
We  also  present  the  allowability  characteristics  of  pipelines  and 


cycles.  In  particular,  we  present  the  structure  of  all  possible  pipe- 
lines allowed  by  a given  cycle. 

In  Chapter  3 we  remove  the  restriction  that  the  initiations  be 
allowable.  Instead  we  provide  internal  buffers  to  resolve  collisions. 

We  present  the  characteristics  of  various  priority  schemes  under  the 
assumption  that  the  input  is  periodic.  Characteristics  considered  are 
the  buffer  size,  throughput  and  execution  delay. 

We  present  some  simulation  studies  of  internally  buffered 
pipelines  in  Chapter  4.  The  major  differences  from  Chapter  3 are  the 
assumptions  that  arrivals  are  random  and  the  buffer  sizes  are  fixed. 
Moreover,  the  results  of  this  chapter  are  experimental  rather  than 
analytic.  We  also  present  the  characteristics  of  various  priority 
schemes . 

Chapter  5 is  an  attempt  at  providing  a design  methodology  for 
cost-effective  pipelines.  Here  we  show  how  the  results  of  Chapters  2, 

3 and  4 can  be  used  in  the  design  of  a pipeline. 

We  summarize  the  main  results  of  this  research  in  Chapter  6. 

We  also  present  some  suggestions  for  further  research  in  this  area. 


17 


I 

I 

I 

! 


I 

I 

I 

I 

I 

I 

I 

! 

I 

I 

I 

I 


Chapter  2 

PIPELINES  WITH  DELAYS 

2.1.  Introduction 

The  usage  intervals  of  a pipeline  determine  its  allowable 
initiation  sequences.  It  should  thus  be  possible  to  design  or  modify 
a pipeline  to  have  a specific  set  of  usage  intervals  which  would  allow 
a desired  initiation  cycle.  This  is  the  subject  matter  of  this  chapter. 
We  shall  investigate  the  allowability  characteristics  of  pipelines  and 
cycles  and  use  this  information  to  modify  a given  pipeline  to  allow  a 
given  cycle.  The  modification  is  restricted  to  the  addition  of 
noncompute  segments  which  merely  provide  some  fixed  delay  between 
selected  computation  steps. 

The  actual  physical  implementation  of  a noncompute  segment 
need  not  be  considered  since  it  does  not  alter  our  treatment.  The 
theory  developed  is  adequate  for  any  physical  implementation.  However, 
some  comments  here  on  the  use  of  a noncompute  segment  are  useful. 
Noncompute  segments  will  be  used  to  delay  both  inputs  to  computation 
steps  and  outputs  from  computation  steps.  Sometimes  it  will  be 
necessary  to  do  both  as  indicated  in  the  following  example. 

Suppose  in  the  reservation  table  of  Figure  2.1.1a  we  want  to 
delay  the  input  to  the  computation  step  in  row  0,  column  2 by  two  time 
units  and  the  input  to  the  step  in  row  2,  column  2 by  one  time  unit.  The 
resulting  reservation  table  is  shown  in  Figure  2.1.1b.  Each  d indicates 
one  unit  of  delay  which  we  shall  refer  to  as  an  elemental  delay.  The 


0 12  3 4 5 6 


So 

X 

di 

CL 

IN) 

X 

s, 

X X d4  d5 

X 

> S2 

□ 

d3  X d6  X 

Figure  2.1.1.  Delaying  parallel  computation  steps  in  a pipeline. 

4 I 

H i 

i 

i 

i » 

i » 


19 


assignment  of  elemental  delays  to  physical  noncompute  segments  will  be 
treated  later.  In  the  absence  of  any  specific  information  on  pre- 
cedence we  assume  that  all  the  steps  in  one  column  must  be  completed 
before  any  steps  in  the  next  column  are  initiated.  Therefore  if  the 
steps  of  column  2 (of  Figure  2.1.1a)  are  unevenly  delayed,  then  we  must 
store  the  outputs  of  some  steps  so  that  all  the  outputs  are  simulta- 
neously available  to  the  steps  of  column  3 (of  Figure  2.1.1a).  These 
delays  are  shown  in  Figure  2.1.1c.  The  elemental  input  delays  d^, 
and  d^  require  d^,  d,.  and  d^  as  elemental  output  delays. 

Sections  2.2  through  2.9  deal  with  single  function  pipelines. 
Most  of  the  results  concerning  single  function  pipelines  can  be 
extended  to  multifunction  pipelines.  This  extension  is  made  in 
Section  2.10.  In  Sections  2.2  and  2.9  we  discuss  the  allowability 
characteristics  of  pipelines  and  cycles.  We  deal  with  constant  latency 
cycles  in  Sections  2.3  through  2.8.  Constant  latency  cycles  are 
treated  in  considerable  detail  since  the  discussion  is  more  straight- 
forward and  furthermore  motivates  the  more  complicated  general  case. 


2.2.  The  Allowability  of  Pipelines  and  Cycles 

In  this  section  we  derive  a necessary  and  sufficient  condition 
for  the  allowability  of  a pipeline  with  respect  to  a given  cycle. 

For  integer  x and  positive  integer  y let  x mod  y denote  the 
remainder  when  x is  divided  by  y.  For  some  set  S^  of  Integers  and 
positive  integer  y let  us  define  S mod  y to  be  the  set  formed  by 


20 


replacing  each  element  x of  S by  (x  mod  y) . We  may  denote  S mod  y as 
simply  Sy  when  the  context  is  clear.  The  following  notation  will  be 
used  consistently  throughout  this  chapter. 

n = the  number  of  initiations  in  a cycle  . . . ,1^  ^) . 

p = the  period  of  a cycle  • • • >^„_i  ) = . 


n-l; 


0<  i<  n 


F = the  set  of  usage  intervals  of  a pipeline. 

G = the  set  of  initiation  intervals  of  a cycle  U ,). 

— J o 1 n-1 

= £g | 3 integers  j,k  where  0<j<k  and  g = £ l .. 

j<i<k(i  mod 

= the  set  of  integers  modulo  p. 

H = the  complement  of  the  set  G mod  p in  Z 
“P  ~ ~V 

= Z - G . 

“P  “P 

Note  that  G is  an  infinite  set.  However,  it  is  very  easy  to 
generate  this  set  from  a given  cycle,  since  G is  trivially  related  to 

V 

Example  2.2.1:  For  cycle  (2, 3, 2, 5) 


n = 4 and  p = 12 . 


The  given  initiation  cycle  can  be  written  as  an  infinite  latency 
sequence, 

<2, 3, 2, 5, 2, 3, 2, 5, 2, 3, 2, 5, ). 


We  can  now  write  the  set  of  initiation  intervals  of  the  above  sequence 
as  described  in  Chapter  1. 


G - {2,3,5,7,9,10,12,14,15,17,19,21,22, 
G12  “ {0,2,3,5,7,9,10}. 


□ 


} 


Note  that  all  elements  of  G greater  than  p can  be  generated 
simply  by  adding  all  the  positive  multiples  of  p to  the  elements  less 
than  or  equal  to  p.  This  property  is  formally  described  below. 

Property  2.2.1:  (a)  0 € and  qp€G  Vq  ^ 1,  always. 

(b)  if  gf*0  then  g € G^  ->  g+qp  € G Vq  > 0. 

Proof : a.  The  initiation  cycle  has  a period  equal  to  p.  Therefore  if 

a task  is  initiated  at  time  t then  another  task  is  initiated  at  t+p. 
Therefore  the  initiation  interval  (t+p)  - t € G.  Carrying  the  same 
argument  further,  other  tasks  are  initiated  at  t+2p, t+3p, . . . 

Therefore  the  initiation  intervals  p,2p,3p, . . . € G or  qp € G Vq  > 1. 

It  immediately  follows  that  (qp  mod  p)^^;  that  is,  OGG^. 

b.  If  g€Gp  then  3 some  integer  r "2.  0 such  that  g+rp € G.  In  other 

words,  there  exist  two  tasks  which  are  initiated  g+rp  units  apart,  say 

at  time  t and  t+g+rp.  Then  again  by  the  same  argument  as  in  (a) 

above,  tasks  are  also  initiated  at  t+p,t+2p,...  t+rp,...  and 

t+g+rp,t+g+(r+l)p,t+g+(r+2)p, . . . . The  tasks  initiated  at  time  t+rp 

and  t+g+rp  are  really  two  different  tasks  because  g#0.  Therefore  the 

intervals  (t+g+rp)- (t+rp) , (t+g+(r+l)p)- (t+rp) ....  6 G.  Thus 

g+qp  €G  Vq  £ 0.  Q.E.D. 

Property  2.2.2:  If  gi*0  then 

g€Gp  <->  (p-g)€Gp. 

Proof:  By  Property  2.2.1b  g€G  ->  g € G.  Then  there  are  two  tasks 

"P  ~ 

which  are  initiated  g time  units  apart.  Suppose  the  two  tasks  are 
initiated  at  time  t and  t+g.  By  the  periodicity  of  the  initiations. 


22 


there  is  also  a task  which  is  initiated  at  time  t+p.  Thus  the 
interval  (t+p) - (t+g) € G.  Therefore 

g£Gp  =>  (p-g)  £g 

=>  (p-g)  mod  p£Gp. 

Since  1 1 g < (p-1),  (p-g)  mod  p = p-g  and  thus  (p-g)€Gp.  The 
reverse  implication  is  also  true,  since  (p-g)  J4  0 and  p - (p-g)  = g. 
Therefore  g € <=>  (p-g)£Gp*  Q.E.D. 

The  same  property  as  above  also  holds  for  the  set  H^,  the 
complement  of  G in  Z , as  shown  below. 

-p  -p 

Property  2.2.3:  h € <=>  (p-h) € H^. 

Proof : First  note  that  h j4  0 since  by  Property  2.2.1a  0 € Gp  always, 

thus  1 £ h <L  (p-1).  Suppose  h 6 Hp  but  (p-h)^Hp,  then  (p-h)€Gp.  By 
Property  2.2.2,  this  implies  h 6 G^  a contradiction  since,  h^H^. 
Therefore 

h € Hp  =>  (P-h)  € Hp. 

And  since  p - (p-h)  = h,  we  have 


h € H <=>  (p-h)  6 H . 
-P  -P 


Q.E.D. 


The  following  theorem  establishes  a necessary  and  sufficient 
condition  for  allowability. 

Theorem  2.2.1:  A cycle  with  period  p and  initiation  interval  set  G 

is  allowed  by  a pipeline  with  usage  interval  set  F,  if  and  only  if 


f pig 
-p  -p 


23 


Proof : A cycle  is  allowable  if  and  only  if  no  collision  occurs  when 

the  cycle  is  followed.  A collision  between  two  tasks  occurs  if  and 
only  if  their  initiation  interval  is  the  same  as  one  of  the  usage 
intervals  of  the  pipeline.  Thus  the  cycle  is  allowable  if  and  only  if 


F n G = §_,  where  is  the  empty  set. 

To  complete  the  proof  it  is  sufficient  to  show  that 


f n g = $ <=>  f n g = $ 

- - - -p  “P  - 

or,  equivalently 

F PI  G j4  § <=>  F (1  G t i. 

- - _ -p  -p  - 


(i)  Suppose  FflG  f4  then  for  some  x£  (fHg) 


x € F =>  (x  mod  p)  € F 

- “P 

x € G =>  (x  mod  p ) 6 G 

- “P 

and  therefore  F HG  J4  $. 

-p  -p  - 

(ii)  To  prove  the  backward  implication, 

suppose  F D G f4  & then  for  some  x € F H G , 
“P  ~P  - -P  “P 

if  x J4  0 then, 


x€Fp  =>  x+jp€F  for  some  integer  j I>  0 

x£Gp  =>x+ip€G  for  all  integers  i > 0 (by  Property  2.2.1). 


Therefore 


x + jp  € f n G . 


Similarly,  if  x “ 0 then 


24 


x£Fp  *>  x+jp€F  for  some  integer  j > 1. 

x€Gp  =*>  x + ip€G  for  all  integers  i > 1 (by  Property  2.2.1). 
Therefore 

x + jp  € F 0 G. 


Thus  in  either  case 


Therefore 


FflG  )*i. 

F n G + $ <=>  F n G t $ . 
- - - -p  -p  - 


Q.E.D. 


An  equivalent  to  the  following  corollary  appeared  in  [SHA72] . 
Corollary  2. 2. 1.1:  A constant  latency  cycle  (X)  is  allowed  by  a 

pipeline  if  and  only  if  no  usage  interval  of  the  pipeline  is  an  integral 
multiple  of  X. 

Proof:  Immediate,  because  period  p*X  and  G^  = (o).  Cycle  (X)  is 

allowable  if  and  only  if  (F  mod  X)  fl  to}  = Q.E.D. 

From  the  above  theorem  we  may  attach  the  following  meaning 

to  the  set  the  complement  set  of  G^.  A cycle  is  allowable  if  and 

only  if  F flG  ■ 4 , or  equivalently  if  and  only  if  F c H . Hence  H 
“P  “P  - “P  “P  “P 

is  the  set  of  allowable  usage  intervals  modulo  p with  respect  to  the 
given  cycle. 


2.3.  Constant  Latency  Cycles 

Sections  2.3  through  2.8  deal  exclusively  with  constant 
latency  cycles.  Given  a constant  latency  cycle,  it  is  possible  to 
modify  a pipeline  by  replicating  some  segments  such  that  the  pipeline 


25 


becomes  allowable.  For  example  one  can  always  replicate  the  segments 
a sufficient  number  of  times  such  that  no  segment  is  used  more 
than  one  time  unit  per  task,  making  the  pipeline  allowable  for  any 
cycle.  But  it  is  not  appare  it  when  one  can  make  a pipeline  allowable 
simply  by  introducing  some  fixed  delays.  We  address  this  problem  in 
the  following  theorem.  A theorem  similar  to  this  was  proven  by  Shar 
[SHA72].  We  shall  see  later  in  Section  2.9  that  there  exists  a some- 
what  similar  but  more  restrictive  statement  concerning  arbitrary  cycles. 
Theorem  2.3.1:  For  a given  constant  latency  cycle  (£),  any  pipeline  can 

be  reconfigured  using  noncompute  segments  such  that  the  pipeline  is 
allowable,  if  and  only  if  the  lower  bound  m of  the  pipeline  is  less 
than  or  equal  to  i. 

Proof : (i)  First  assume  m < i.  By  Corollary  2. 2. 1.1,  a constant 

latency  cycle  (4)  is  allowable  if  no  usage  interval  (i.e.,  distance 
between  any  two  X's  in  any  row  of  the  reservation  table  of  the  pipe- 
line) is  an  integral  multiple  of  l.  For  a particular  row  the  distance 
between  an  X in  the  i*1*1  column  and  an  X in  the  j*-*1  column  is  | j-i|  . 

This  distance  is  a multiple  of  SL  if  and  only  if  i=  j (mod  1) . 

Let  us  relabel  each  column  by  its  modulo  l representative, 
that  is,  we  relabel  a column  marked  i with  (i  mod  £).  Now  if  we  make 
certain  that  no  two  X's  in  a row  have  the  same  column- label  then  we  have 
ensured  that  no  usage  interval  is  an  integral  multiple  of  l.  We  give 
the  following  construction  to  show  that  this  can  always  be  done  by 
delaying  some  computation  steps. 


26 


Scan  the  columns  of  the  relabeled  reservation  table  from 
left  to  right.  Then  scan  each  column  from  top  to  bottom.  As  each  X 
is  found,  it  may  be  moved  to  the  right  to  ensure  that  its  column- 
label  is  different  from  the  column-labels  of  all  previously  found  X's 
in  that  row.  This  can  always  be  done  since  there  are  H distinct 
labels,  but  there  are  no  more  than  m X's  in  a row  and  m < £.  Moving 
an  X to  the  right  by  k columns  is  equivalent  to  delaying  that  computa- 
tion step  by  k time  units.  (If  any  X's  in  the  column  being  processed 
are  moved  then  the  unscanned  columns  of  the  reservation  table  are 
moved  to  the  right  to  preserve  the  order  of  computation  steps  as 
implied  by  the  original  table.) 

(ii)  The  reverse  implication  is  straightforward.  From  the  definition 
of  the  lower  bound  m,  no  pipeline  can  ever  be  allowable  by  insertion 
of  delays  if  m > 1.  Q.E.D. 

Figure  2.3.1  illustrates  the  use  of  the  construction 
described  in  the  proof  of  the  above  theorem.  Figure  2.3.1a  is  the 
original  reservation  table  and  Figure  2.3.1b  is  the  reconfigured  table 
for  constant  latency  1=  4.  Figure  2.3.1c  shows  the  positions  of 
elemental  delays  marked  as  "d."  The  connectors  between  d and  x 
indicate  whether  the  input  or  the  output  of  that  step  is  delayed. 

For  consistency,  we  shall  write  the  elemental  delay  at  the  input  side 
of  the  computation  step  except  when  it  is  necessary  to  use  delay  at  the 
output  side  also,  as  was  described  in  Section  2.1. 

If  one  noncompute  segment  is  used  for  each  elemental  delay 
then  the  reservation  table  in  Figure  2.3. Id  results,  where  through 


28 


D10  are  noncompute  segments.  However,  some  of  the  elemental  delays 
c !>n=  be  provided  by  a single  physical  noncompute  segment.  Thus  resource 
sharing  is  discussed  in  the  following  section. 


2.4.  Sharing  of  Noncompute  Segments 

First  we  shall  consider  the  sharing  with  some  restriction. 

We  assume  that  a noncompute  segment  cannot  be  shared  by  more  than  one 
elemental  delay  during  a single  time  unit  and  that  an  elemental  delay 
may  not  use  more  than  one  noncompute  segment . 

Note  that  datapath  width  may  vary  from  one  computation  step 
to  another  and  therefore  the  datapath  widths  of  elemental  delays  also 
may  vary.  However,  this  does  not  prevent  several  elemental  delays  from 
sharing  a single  noncompute  segment,  as  long  as  the  datapath  width  of 
the  noncompute  segment  is  greater  than  or  equal  to  that  of  each  of  the 
elemental  delays.  In  order  to  separate  the  issues  involved,  our  only 
concern  now  is  how  sharing  can  be  done  without  affecting  the  allow- 
ability of  the  pipeline  for  the  given  cycle.  Path  width  and  "bit- 
segments"  are  considered  later  in  this  section. 

Let  (d^.d^, . . . ,d.  ) be  the  set  of  elemental  delays  ordered 
arbitrarily  in  some  reconfigured  reservation  table,  and  let 

. . . *t^)  be  their  corresponding  set  of  column  numbers.  That  is, 
d^  is  located  in  the  column  labeled  t^  of  the  reconfigured  reservation 
table,  where  the  columns  are  indexed  0,1,2,...  from  left  to  right  in 


increasing  order 


29 


For  example  the  reconfigured  table  in  Figure  2.3.1c  has 
(d^  ,d^  , . . . ,d^)  as  the  set  of  elemental  delays  and 

(4,8,9,14,11,13,9,10,11,12)  as  the  corresponding  ordered  set  of  column 
numbers . 

A single  noncompute  segment  can  be  shared  by  elemental  delays 
d^  and  d^  without  affecting  the  allowability  of  a constant  latency 
cycle  (X),  if  and  only  if  d^  and  d^  are  not  an  integral  multiple  of  X 
apart  (by  Corollary  2. 2. 1.1).  That  is  | t^-t ^ | mod  X + 0.  If 


share  the  same  noncompute  segment.  The  relation  "incompatible  with" 
defined  by  "d^  incompatible  with  d^  if  t^=tj  (mod  X)"  is  an  equivalence 
relation  on  the  set  (d^jd^, • . . ,d^) . This  relation  partitions  the  set 
of  delay  elements  into  equivalence  classes.  We  call  these  classes 
incompatibility  classes. 

No  two  members  of  an  incompatibility  class  can  share  a single 
noncompute  segment.  Any  two  or  more  elemental  delays,  each  belonging  to 
a distinct  incompatibility  class,  can  always  share  the  same  segment. 

It  follows  immediately  that  the  size  of  the  largest 
incompatibility  class  is  the  minimum  number  of  noncompute  segments 
required  to  provide  all  the  elemental  delays.  Also  note  that  X or 
fewer  elemental  delays  appearing  in  successive  columns  can  always  share 
a single  noncompute  segment,  because  the  greatest  separation  between  any 
two  such  elemental  delays  is  less  than  X and  therefore  no  usage  interval 


can  be  a multiple  of  X. 


30 


Take  again  the  example  of  Figure  2.3.1c,  which  was 
reconfigured  for  constant  latency  4.  The  incompatibility  classes 
of  the  set  of  elemental  delays  are 

{dl,d2,dio) »(d3,d6>d73 ,{d4,d83 ,{d5,d93 . 

The  size  of  the  largest  class  is  3,  therefore  3 noncompute  segments  are 

required  to  provide  elemental  delays  d^  thru  d^Q.  Of  many  possible 

sharing  configurations  one  is  shown  in  Figure  2.4.1  where  S„ , S,  and 

3 4 5 

are  noncompute  segments. 

In  some  situations  the  type  of  segment  sharing  proposed  above 
may  not  be  suitable.  One  can  for  example  easily  formulate  a "bit 
sharing"  scheme.  We  next  present  such  a scheme,  which  can  easily  be 
substituted  for  segment  sharing.  The  discussion  in  later  sections, 
however,  reverts  to  segment  sharing  since  segment  sharing  illustrates 
our  fundamental  points  and  is  notationally  far  simpler. 

Define  a noncompute  bit-segment  to  be  a one  bit  wide  non- 
compute segment,  similarly  an  elemental  bit -delay  is  a one  bit  wide 
elemental  delay.  Thus  an  elemental  delay  d^  which  is  w^  bits  wide  can 
be  thought  of  as  being  made  up  of  w^  elemental  bit-delays.  The  problem 
now  reduces  to  that  of  sharing  of  noncompute  bit-segments  by  elemental 
bit-delays.  This  problem  is  similar  to  the  previous  problem. 

If  the  elemental  delays  d^  and  d^  are  incompatible  then  so 
are  the  elemental  bit-delays  of  d^  and  d ^ . The  elemental  bit-delays  of 
di  may  be  denoted  by  du*di2’ • * ’ ,(1iw  ’ T^e  incomPatibility  classes  of 
elemental  bit-delays  may  be  formed  directly  from  the  incompatibility 


32 


classes  of  elemental  delays  already  formed  earlier.  Thus  for  example, 
the  class  {d^.d^jd^}  formed  earlier  in  the  discussion  of  Figure  2. 3. Id 
gives. 


*-d31,d32  * • • • ,d3w,’d61,d62> ' - * ,d6w,  ,d71’d72’  * * ’ ’d7w^  * 

JO/ 

In  this  manner  we  can  form  all  the  bit  incompatibility  classes. 
These  classes  have  the  same  properties  as  those  of  incompatibility 
classes  of  elemental  delays,  namely,  no  two  elemental  bit-delays  from 
the  same  class  can  share  a noncompute  bit-segment  and  two  or  more  bit- 
delays  each  contained  in  a distinct  incompatibility  class  can  always 
share  a single  noncompute  bit-segment.  Thus  the  size  of  the  largest 
bit  incompatibility  class  is  the  minimum  number  of  noncompute  bit- 
segments  required  to  provide  all  the  bit-delays. 


2.5.  Defining  the  Optimum  Solution 

The  construction  described  in  the  proof  of  Theorem  2.3.1  is 
merely  an  existence  proof,  it  does  not  necessarily  provide  a 'good' 
solution.  This  construction  when  used  on  the  reservation  table  of 
Figure  2.3.1a  produced  the  table  of  Figure  2.3.1c.  This  solution  has 
10  elemental  delays.  If  the  criterion  for  goodness  is  the  fewest 
elemental  delays  then  the  reservation  table  of  Figure  2.5.1  with  only 
two  elemental  delays  is  a better  solution  to  the  same  problem..  Thus 
there  may  be  many  solutions  to  the  same  problem.  In  this  section  we 
define  an  optimum  solution.  In  the  next  two  sections  we  discuss  how  to 


VJfcn. 


33 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10  II 

So 

B 

ED 

b 

■ 

■ 

B 

■ 

B 

s, 

■ 

B 

■ 

u 

□ 

B 

■ 

B 

i 

s2 

I 

■ 

ES 

B 

□ 

_ 

■ 

■ 

B 

0 1230  1230  123 


Figure  2.5.1.  An  optimum  solution  to  the  problem  of  making  the 
pipeline  of  Figure  2.3.1a  allowable  with  respect 
to  cycle  (4). 


•w 


I 

( 

! 


obtain  such  optimum  solutions.  Two  important  parameters  in  such  a 
solution  are  added  execution  time  and  cost. 


34 


•v 

% • 

*4 


'N  I* 


I 

I 

I 

I 

I 

I 


I 


I 


The  increase  in  execution  time  is  caused  by  the  delaying  of 
some  computation  steps.  For  example  the  solution  of  Figure  2.3.1c 
takes  6 time  units  more  than  the  original  execution  time  of  10  units. 

If  the  pipeline  can  be  kept  full  by  initiating  tasks  at  the  maximum 
rate,  then  the  throughput  is  the  same  as  the  input  rate  irrespective  of 
the  added  execution  delay.  However,  in  some  situations  it  becomes 
necessary  to  empty  the  pipeline  before  a new  task  can  be  initiated. 

For  example,  consider  some  tasks  which  are  logically  dependent;  that 
is,  some  task  can  be  initiated  only  after  the  execution  of  some 
previously  initiated  task  is  completed.  Another  example  of  such 
dependency  is  conditional  branching;  that  is,  the  outcome  of  some 
operation  determines  which  one  of  two  or  more  streams  of  tasks  should 
be  initiated.  Thus  the  total  execution  time  can  become  an  important 
parameter  in  determining  the  overall  throughput.  In  such  situations 
we  would  like  to  minimize  the  added  execution  delay  when  reconfiguring 
a reservation  table  by  the  use  of  added  noncompute  segments. 

The  second  parameter  is  the  cost,  that  is,  the  cost  of  adding 
noncompute  segments.  Some  of  the  factors  which  contribute  significantly 
to  this  cost  are:  the  cost  of  the  noncompute  segments  themselves,  the 
cost  of  additional  control  and  the  cost  of  additional  data-paths.  The 
number  of  noncompute  segments  may  be  reduced  by  utilizing  the  sharing 
techniques  described  in  the  previous  section.  However,  sharing  usually 
adds  more  datapaths.  Also  it  makes  data  routing  among  segments  more 


I 

I 

I 


35 


complex.  On  the  other  hand  some  types  of  sharing  are  simple  and 
profitable.  For  example,  as  mentioned  before, £ or  fewer  successive 
elemental  delays  can  always  share  a single  noncompute  segment.  Note 
that  such  sharing  is  profitable  since  it  can  be  implemented  by  a single 
register  with  an  appropriate  clock  without  creating  any  additional 
datapaths.  Furthermore,  an  input  elemental  delay  may  be  replaced  by 
an  output  delay  of  the  preceding  computation  step  and  vice  versa. 

Though  these  configurations  are  logically  equivalent,  they  are  not 
necessarily  equivalent  in  cost.  It  is  clear  that  to  compute  the  cost 
difference  between  any  two  configurations  of  a pipeline  would  require 
extensive  description  of  all  datapaths  and  their  widths,  the  actual 
techniques  used  for  data  routing  (or  communication  among  segments)  and 
the  actual  implementation  of  noncompute  segments  together  with  the 
formulas  for  computing  the  cost  from  these  parameters.  A formal  cost 
model  can  be  formed  for  a specific  case,  but  it  would  be  very  cumber- 
some and  of  doubtful  accuracy  in  the  general  case. 

From  now  onwards  we  shall  concern  ourselves  only  with  mini- 
mizing execution  time,  and  an  optimum  solution  will  be  defined  to  be  a 
solution  having  minimum  execution  delay. 

The  algorithm  to  be  presented  in  Section  2.7,  is  capable  of 
producing  all  the  solutions  having  minimum  execution  delay.  Also  it 
is  possible  to  produce  solutions  having  execution  delay  below  a given 
bound.  A designer  can  then  choose  a solution  out  of  these  to  optimize 
secondary  objectives,  such  as  cost. 


2.6.  Formulation  of  the  Optimization  Problem 

The  problem  of  making  a pipeline  allowable  for  some  constant 
cycle  (£)  in  an  optimal  fashion  can  be  formally  stated  as: 

Minimize  added  execution  delay  subject  to 
{f  mod  p} D {0}  = £ 

where,  F is  the  set  of  usage  intervals  of  the  reconfigured  reservation 
table  and  p is  the  period  of  the  cycle  and  is  equal  to  l.  This 
constraint  comes  directly  from  Corollary  2. 2. 1.1. 

The  following  two  examples  show  how  we  can  formulate  the 
problem  for  a given  reservation  table.  First,  some  definitions. 

Let  D be  the  added  execution  delay  and  let  d^  p-id  d^  be  the 
number  of  elemental  delays  to  be  inserted  respectively  at  the  input  and 
the  output  of  an  X originally  in  row  i and  column  j (referred  to  as 
cell  (i,j))  of  the  reservation  table.  If  no  X occurs  in  cell  (i,j)  of 
the  table  then  d^  and  d^  are  defined  to  be  zero. 

Example  2.6.1:  For  the  reservation  table  of  Figure  2.6.1  we  can 

express  the  added  execution  delay,  D and  the  constraints  as  follows. 

We  need  not  include  those  variables  d^j  and  d^\  which  are  zero  by 
definition. 

D " d00  +dll  +d22  +d23  +d14  +d05  +d06  +d17 
and  the  constraints  are. 


integer  d^  ;>  0 


and 


37 


!•  (dn +d22 +d23 +d14 +d05 +5^  m°d  P ^ ° 

2 . (d^  + 1)  mod  p # 0 

3.  (dn+d22+d23+d14+d05+d06+6)  mod  p * 0 

4.  (^2  +d23  +di4  + 3)  mod  P ^ 0 

5.  (d05  +d06  +d17  +3^  mod  P * 0 

6.  (d22  +d23  +d14  +dQ5  +dQ6  +d1?  +6)  mod  p ^ 0 

7.  (d23  + 1)  mod  p j4  0 


<'X00,X05'> 

^x05»X06^ 

'X00,X06^ 

(x  X ) 
'A11*A14' 

^x14»x17^ 

<X11’X17> 

(x22>x23> 


The  pair  of  X's  in  the  triangular  brackets  indicate  that  the 
constraint  on  the  left  resulted  from  that  particular  pair.  □ 

Notice  that  we  did  not  use  the  variable  dj^  in  any  of  the 
above  constraints.  This  is  because  the  reservation  table  did  not  have 
more  than  one  X in  any  column  and  therefore  the  output  of  a step  is 
identical  to  the  input  of  the  next  step  and  output  delays  are  not 
required.  The  next  example  is  more  general  and  we  will  need  to 
insert  delays  at  the  output  of  some  computation  steps.  Output  delays 
were  discussed  in  Section  2.1. 

Example  2.6.2:  For  the  reservation  table  of  Figure  2.6.2  we  can  express 

the  added  execution  delay  D and  the  constraints  as  follows.  Again  we 
shall  omit  those  variables  d^  and  d^  which  are  zero  by  definition. 


maxi 


^dnn»din*d9n^  +doi  +d22  +max^dn>d9^  +max^dn/.  »di/.  ^ • 


*00*  10*  20' 
The  constraints  are, 


13  23 


04’  14J 


integer  d^  > 0 and 


0 

1 

2 

3 

4 

5 

6 

7 

So 

Xi 

XI 

[X 

Si 

X 

X 

X 

S2 

[XI 

[X 

Figure  2.6.1.  Reservation  table  for  Example  2.6.1. 


2 3 


I 

I 


39 


1. 

(doo+doi  + 1)  mod  P * 0 

<*00*X01> 

2. 

(d22 +max{d13,d23}  +dQ4 +3)  mod  p ^ 0 

(X01*X04> 

3. 

(d00  +d01  +d22  + “*t<113ld231  +d04  +4>  "Md 

P # 0 

X 

o 

o 

V 

o* 

■p^ 

4. 

(d;o+d01+d22+d13+3>  “d  P * 0 

(X10,X13) 

5. 

(dl3+d14  + 1)  mod  P * 0 

<X13’X1A> 

6. 

(d10  +d01  +d22  +max[d13>d23}  +d14  +4)  m°d 

P i1  0 

<x10»xi4> 

7. 

(d20 +d01 +d22 +2^  mod  P ^ 0 

<'X20,X22^ 

8. 

(d23  + 1)  mod  p / 0 

<x22,x23> 

9. 

(d^0+dQ1+d22+d23+3)  mod  p * 0 

<x20»X23^ 

where 

d00  ~ raax(d00’d10’d20)  ' d00 
d10  = max<d00’d10’d20)  - d10 
d20  = ““^OO^IO’^O^  " d20 

d|3  = max(d13,d23)  - d^.  □ 

In  general  we  can  express  the  added  execution  delay  D and 
the  constraints  in  the  following  manner.  Let  I be  the  number  of  rows 
and  J be  the  number  of  columns  in  the  given  reservation  table,  then 

D = £ max  (d  ) (2.6.1) 

0<j<J  0<i<I  1J 

and  the  constraints  are 

integer  d^  > 0, 

(d * . + £ max  (d  ) +d  + (c-b))mod  p t 0 

ab  b<j<c  0<i<I  3 

for  each  pair  (Xab,Xac)  with  c>b.  (2.6.2) 


i 


40 


where 


d ' = max  (d.,  ) - d . 

ab  (Xi<I  lb  ab 


(2.6.3) 


If  there  is  no  X in  the  cell  (i,j)  of  the  reservation  table 

then  the  variable  d^  should  be  assigned  the  value  zero  in  all  of  the 

above  expressions.  / 

In  the  constraints  the  term  d'  is  the  number  of  elemental 

ab 

delays  introduced  at  the  output  of  computation  step  X The  variable 
d^b  need  not  appear  explicitly  in  the  constraints  because  it  is 
expressible  in  terms  of  the  input  delays.  The  variable  d is  the 

3C 

number  of  input  elemental  delays  at  step  X c»  The  input  delay  plus 

the  output  delay  for  column  j is  given  by  max  d...  Therefore,  the  sum 

i 

term  in  each  constraint  is  the  effect  of  inserted  delays  in  the 

intervening  columns  between  X ^ and  Xa(;.  Finally  the  term  (c-b)  is  the 

separation  which  existed  between  X , and  X before  the  insertion  of 

ab  ac 

any  delays.  The  objective  function  is  simply  the  added  column  delay 
summed  over  all  columns. 

While  forming  the  constraints,  those  variables  must  be  set  to 
zero  where  it  is  not  possible  to  introduce  a delay.  Take  for  example  the 
table  in  Figure  2.6.1.  In  the  row  of  segment  there  are  two  successive 
X's.  These  two  X's  might  have  resulted  from  a single  indivisible 
computation  step  which  is  2 units  long.  In  such  a case  variable  d^ 
would  have  to  be  set  to  zero  to  preclude  possible  preemption  of  an 
indivisible  computation  step.  Setting  variables  to  zero  in  this  manner 


I 

does  not  destroy  the  existence  of  a solution.  This  should  become 
obvious  when  we  discuss  allowable  rows  of  a cycle  in  Section  2.9. 


I 

I 

I 

I 


2.7.  Obtaining  the  Optimum  Solution 

There  is  no  analytic  solution  to  the  optimization  problem 

presented  in  the  previous  section,  except  for  a special  but  important 

class  of  pipelines  presented  in  the  next  section.  The  solution  space 

is  finite  because  each  d. . need  only  take  integer  values  between  0 and 

ij 

p-1,  as  all  the  constraints  are  given  in  modulo  p arithmetic.  This 
places  an  upper  bound  on  the  added  execution  time  equal  to  (p-l)*J, 
where  J is  the  number  of  columns  in  the  reservation  table.  Moreover 
the  objective  function  D is  nondecreasing  in  d„ ; that  is,  any  addition 
of  an  elemental  delay  either  increases  the  execution  delay  or  leaves  it 
unchanged.  These  properties  allow  a branch-and -bound  search  technique. 
We  present  below  such  a branch -and -bound  algorithm. 

Algorithm  2.7.1:  Let  the  number  of  X's  in  the  reservation  table  be  x 
and  let  the  x variables,  d , be  stored  in  any  arbitrary  order  in  a one 
dimensional  array  V.  Let  D(i)  represent  the  value  of  the  objective 
function  for  given  values  of  V(l)  through  V(i),  with  V(i+1)  through 
V(x)  taken  to  be  0. 

Bl.  [initialize]  i«- 0;  BOUND*-  (p-l)-J; 

B2.  [Advance]  i«-  i+1;  V(i)«-0; 

B3.  [Check  bounds  and  constraints]  if  (V(i)=p)  or  (D(i)>  BOUND)  then 
go  to  B6;  if  a completely  assigned  constraint  is  violated  then  go 


I 


42 


to  B5; 

B4.  [Solution  found?]  if  i<x  then  go  to  B2  else  output  the  solution 
V(l)  through  V(x)  and  D(x);  BOUND<-D(x); 

B5.  [Try  another  value]  V(i)<-  V(i)+1;  go  to  B3; 

B6.  [Backtrack]  i <—  i - 1 ; if  i>0  then  go  to  B5  else  terminate  the 

algorithm.  □ 

The  last  value  of  BOUND  is  the  minimum  value  of  the  objective 
function  over  all  possible  solutions  and  therefore  the  output  solutions 
meeting  this  bound  are  all  the  minimum  added  delay  solutions. 
Explanation  of  some  of  the  steps  in  the  above  is  as  follows.  ' 

In  step  B3,  if  V(i)=p  then  we  have  assigned  all  possible  . . 
values  to  V(i).  If  the  current  value  of  the  objective  function  exceeds 
BOUND  then  we  need  not  proceed  since  the  objective  function  can  o*nly 
increase  further.  In  either  case  we  must  backtrack.  If  a completely 
assigned  constraint  is  violated  then  the  current  value  of  V(i)  is  not 
acceptable  and  therefore  we  must  try  another  value.  Note  that  the 
constraints  which  consist  only  of  variables  V(l)  through  the  current 
variable  V(i),  are  completely  assigned.  Out  of  these  we  need  to  check 
only  those  constraints  which  actually  have  V(i)  as  a variable,  since 
the  other  constraints  would  have  been  verified  previously. 

In  step  B4,  if  i<x  then  there  are  still  more  variables  to  be 
assigned  some  values.  If  i=x  then  all  variables  have  been  assigned 
successfully  without  violating  a constraint  or  bound  and  therefore  we 


have  a solution. 


43 


In  step  B6,  while  backtracking,  if  no  variables  are  left 
then  all  possible  solutions  have  been  either  fully  developed  or 
rejected  for  exceeding  BOUND  or  violating  constraints.  We  thus 
terminate  the  algorithm. 

The  above  algorithm  generates  all  the  minimum  delay  solutions. 
If  only  one  optimum  solution  is  desired  then  the  condition  D(i)> BOUND 
in  step  B3  should  be  changed  to  D(i)>  BOUND. 

Another  possibility  is  that  we  start  the  algorithm  with  some 
specific  value  for  BOUND  and  do  not  update  it  later,  that  is,  remove 
the  statement  BOUND«-D(x)  in  step  B4.  Then  the  algorithm  generates 
all  solutions  which  have  their  objective  function  value  less  than  or 
equal  to  the  fixed  value  of  BOUND.  This  version  would  be  particularly 
useful  if  a designer  has  some  secondary  objectives  for  his  solutions. 

A complete  example  with  the  constraints  and  the  backtrack 
tree  is  given  in  Figure  2.7.1.  The  variables,  d_,  have  been  retained 
in  the  figure  for  simplicity.  B is  the  variable  BOUND,  ' >B' 
indicates  that  the  bound  has  been  exceeded,  and  1 V indicates  that  the 
constraint  (a)  has  been  violated. 

Algorithm  2.7.1  is  remarkably  efficient  in  our  limited 

experience.  For  example  for  one  20  variable  problem  with  a potential 
14  4 

10  nodes  in  the  tree,  only  10  nodes  were  expanded  and  an  optimum 
solution  was  obtained  in  40  seconds  on  an  IBM  360/67.  For  a particular 
class  of  problems,  the  technique  of  Knuth  [KNU75]  may  be  applicable  to 
estimate  the  complexity  of  the  algorithm. 


* v 


44 


0 


So 

S, 


X 

X 

X 

X 

Make  the  pipeline  allowable  for  cycle  (2): 
H mod  2 = {lj 
Added  delay: 

D = n*x{d00,d10}+du+d02 
Constraints : 

(i)  [2+max  U00,d1()}  -d00+dn  + 

dQ2]  mod  2 € { l} . 

(ii)  Cl+max  td00,d10}  -d1Q  + 

d mod  2 € { l} . 


[B*-3l 

doo 


Optimum  solutions  are:  1. 

2. 


00 

- dio  - dn 

- 0, 

d02 

loo 

* d 

*=  H 

- o 

d 

11 

02 

w > 

°10 

Figure  2.7.1. 


A branch-and -bound  search  for  optimum  solutions 


45 


2.8.  Looping  Pipelines 

A looping  pipeline  is  characterized  by  a reservation  table 
which  reserves  the  segments  for  one  time  unit  each  in  some  sequence 
and  then  repeats  this  sequence  several  times.  This  special  class  of 
pipelines  is  of  considerable  importance  as  is  indicated  by  their  common 
occurrence;  e.g.,  in  the  floating  point  multiply/divide  units  of  the 
IBM  360/91  and  the  Texas  Instruments  ASC.  Davidson  [DAV71]  and  Shar 
[SHA74]  have  shown  that  a greedy  initiation  strategy  for  this  class  of 
pipelines  always  achieves  maximum  throughput  but  possibly  only  with  a 
non-constant  latency  cycle.  Here  we  are  interested  in  using  constant 
latency  cycles. 

The  reservation  table  of  Figure  2.8.1a  is  an  example  of  a 
looping  pipeline.  This  pipeline  has  a lower  bound  latency  = 6.  The 
cycle  (1,1,16)  with  average  latency  6 is  allowable  for  this  pipeline  but 
the  constant  latency  cycle  (6)  is  not. 

All  the  segments  in  a looping  pipeline  have  identical  usage 
patterns  and  hence  each  segment  has  the  same  set  of  usage  intervals. 
This  property  simplifies  the  problem  considerably,  as  the  following 
results  indicate. 

Lemma  2 . 8 . 1 . 1 : While  making  a looping  pipeline  allowable  with  respect 

to  a given  cycle,  by  adding  the  least  possible  execution  delay,  it  is 
sufficient  to  consider  only  those  delays  which  are  placed  between 
successive  iterations  (such  as  d^,...d^  in  Figure  2.8.1a). 

Proof : All  the  rows  of  a reservation  table  of  a looping  pipeline 

have  identical  usage  patterns.  Therefore  an  optimum  way  of  inserting 


I 


I 


0 12  3 4 5 6 7 8 9 10  II  12  13  14  15  16  17 


X 

■ 

B 

■ 

■ 

B 

■ 

■ 

B 

■ 

1 

B 

■ 

B 



X 

B 

■ 

B 

■ 

1 

a 

■ 

B 

1 

X 

b 

■ 

■ 

B 

■ 

■ 

b 

■ 

■ 

m 

X 

1 

1 

t 1 

i 

d(  d2  d3  d4  d5 


0 

1 

2 

3 

4 

” 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

So 

X 

■ 

1 

B 

M 

X 

L 

IXI 

s, 

fl 

B 

B 

B 

B 

X 

S2 

■ 

B 

B 

□ 

B 

B 

■ 

■ 

■ 

B 

X 

S3 

_ 

_ 

_ 

B 

■ 

■ 

_ 

B 

■ 

■ 

(b) 

Figure  2.8.1.  Making  a looping  pipeline  allowable  with  respect 
to  a constant  latency  cycle. 

I 

I 

I 

I 

I 


0 


\ 


. 1 


| 


i 

i 

i 

i 

! 


i 

I 

I 

I 

I 


47 


delays  in  one  row  is  also  optimum  for  all  other  rows.  Pick  an 
arbitrary  row  and  find  a solution  which  adds  fewest  possible  delays 
to  this  row  and  makes  it  allowable.  This  solution  can  be  applied  to 
the  whole  reservation  table  such  that  the  delays  fall  between  the 
iterations.  This  insures  that  the  sets  of  usage  intervals  of  all  rows 
are  changed  identically.  Thus  all  rows  are  made  allowable  optimally  by 
using  only  those  delays  which  are  placed  between  successive 
iterations.  Q.E.D. 

Theorem  2.8.1:  For  a looping  pipeline  with  lower  bound  average  latency 

m and  number  of  segments  s,  there  exists  the  following  assignment  of 
values  to  delays  d^,d^ , . . .d^ (see  Figure  2.8.1a)  so  that  the  cycle  (£), 
for  £>m,  is  allowable  and  the  added  execution  delay  is  a minimum: 

d =d„  =d_  = =dr  = 1 all  other  d.  = 0. 

q 2q  3q  f m-ql  q l 

i q I 

Where  q is  the  smallest  positive  integer  such  that  £ divides  qs;  that 
. £ lcm(£,s)  . . . 

is,  q = gCd'(£  s)  = g where  gcd  stands  for  greatest  common 

divisor  and  1cm  for  least  common  multiple. 

Proof:  We  need  to  satisfy  the  following  constraints  in  order  to  make 

the  pipeline  allowable  for  cycle  (£). 


r 


I 

I 


I 

I 

I 

I 

I 


48 


(dj  + s)  mod  £ t 0 

(d^  +s)  mod  £ f 0 

• 

0 

0 

(dffl_  ^ + s ) mod  2 ft  0 

(di+d2+2s)  mod  2 ft  0 

^ +d^  + 2s)  mod  £ ft  0 

• 

(dm-2 +dm-l +2s^  mod  1 * 0 

• • 

• • 

• • 

(d  ^ + d2  + • • • + d^  + ks  ) mod  £ ^ 0 

(d2  + d^  + • • • + dfc+1  + ks ) mod  £ ^ 0 

(dm_k  + ' * ‘ +dm.1  +ks)  mod  £ / 0 

• • 

• • 

• • 

(dl  +d2  + " * +dm_i  + (m-l)s)  mod  £ / 0. 


After  substituting  the  values  of  d^s  as  indicated  in  the  theorem 
statement,  each  inequality  is  of  the  following  form, 

(i+js)  mod  l ft  0 where  i,j  are  integers. 

At  most  (q-1)  consecutively  subscripted  variables  are  zero.  Therefore, 

if  i»0  then  j<q.  (j\ 

There  are  exactly  p2^]  variables  with  value  1 and  the  others  have 
value  0,  therefore 


49 


I 

I 

I 


I 

I 

I 

I 

l 

I 


Now  X>m  therefore  i < however  q|X,  thus  i < 

which  implies 

f<l.  (2) 

Now  suppose  one  of  the  constraints  is  violated;  that  is,  there  exists 
a constraint  such  that 

i +js  = kX.  (3) 

Multiplying  this  equation  by  q/X  we  get 

Now  must  be  an  integer  since  X|qs.  However,  ^ < 1 by  (2).  Thus, 

= 0 =>  i = 0 . 

Substituting  in  equation  (3)  we  have  js  = kX.  So  X|js,  but  q is  the 
smallest  positive  integer  such  that  x|qs.  Therefore 

j ^ q. 

But  this  is  a contradiction  to  the  fact  that  j < q when  i * 0 (by 
equation  (1)).  Therefore  no  constraint  is  violated  and  hence  the 
stated  assignment  of  d^'s  makes  the  pipeline  allowable  for  cycle  (X). 
Now  to  prove  that  this  solution  is  also  optimum,  consider  the  following 
subset  of  the  constraints: 


I 

I 

I 

I 


I 

I 

I 

I 

! 


<dl  + d2  + 

\ 

<d2  + d3  + 

(dk+dk+l+  + dk^_1+q8)mod£  * 0 

(d  + + d . + qs)  mod  £ / 0 

m-q  m-1 


50 


(4) 


This  formulation  assumes  that  q<im.  If  q > m then  the  added  delay  is 
zero  and  the  solution  is  obviously  optimum.  Recall  that  qs  mod  £ = 0. 
Therefore  to  satisfy  the  above  inequalities  some  variables  in  each 
inequality  must  be  nonzero.  There  are  (m-q)  inequalities  in  the  above 
subset.  Any  single  variable  d^  appears  in  at  most  q of  these 
inequalities.  Therefore  at  least  ' variables  must  be  nonzero  and 

the  solution  must  have  a total  added  delay  of  at  least  j . This 
is  exactly  the  case  for  our  choice 

d = d = =d,  = 1,  other  d = 0. 

q 2q  m^£  <1  i 

q 

Therefore  the  solution  is  optimum.  Q.E.D. 

Corollary  2. 8. 1.1:  If  £ divides  ms  then  the  above  solution  is  unique. 

Proof:  The  constraints  presented  in  the  above  proof  can  be  considered  as 

the  constraints  for  making  a single  row  allowable.  In  this  case  dt  should 
be  thought  of  as  the  number  of  elemental  delays  to  be  added  between  the 
i*"^1  and  (i+l)t^1  X in  that  row.  Obviously  all  the  rows  have  identical 
sets  of  constraints.  Thus  if  a solution  is  unique  for  one  row  then  it 
is  also  unique  for  any  other  row.  The  only  way  to  apply  this  solution 


- 


51 


simultaneously  to  all  the  rows  in  the  reservation  table,  is  to  insert 
the  delays  only  between  iterations.  This  in  turn  shows  that  the 
complete  solution  is  also  unique.  Therefore  it  is  sufficient  to  show 
that  under  the  given  conditions  the  assignment  of  values  to  d^'s  given 
by  the  above  theorem  is  unique.  The  proof  follows. 

I 

q - gc(j  (£  s)  statement  of  the  above  theorem. 

£|ms  =>  q • gcd (j£ , s) I ms  =>  q|m 


gcd(£,s) 


Ju  S I 

But  q(=  g7d(i.7s)^  and  gCd(l>s)  are  relatively  Prime.  Therefore  q|m. 


This  implies 


m-q  _ m-q 


. Then  the  optimum  solution  given  by  Theorem 


2.8.1  is 


d =d  =d  =1  and  other  d.  = 0. 

q 2q  .m-q.q  l 

q > 

Consider  again  the  subset  (4)  of  inequalities  used  in  the  above  proof. 
There  are  variables  which  are  1 and  each  one  covers  exactly  q 
distinct  inequalities  in  (4). 

Since  no  variable  appears  in  more  than  q of  the  constraints 

in  (4),  any  other  optimum  assignment  must  have  each  nonzero  variable 

also  covering  q distinct  inequalities. 

If  d^  = 1 for  some  i 1<  i<q-l,  then  d^  covers  exactly  i 

inequalities,  which  is  less  than  q.  Therefore  d^ = 1 is  not  an  optimum 

choice  and  therefore  d^ = 1 in  order  to  cover  the  first  inequality.  The 

first  q inequalities  are  thus  covered. 

Suppose  now  d =1  for  some  i,  l<i<q-l,  then  d covers 
q+i  _ q+i 

q inequalities,  but  out  of  these,  q-i  inequalities  are  already  covered 


52 


by  dq  and  therefore  d^+^  covers  only  i<q  new  inequalities,  and  thus 

d ,.  = 1 is  not  an  optimum  choice.  Then  d„  = 1 is  needed  to  cover 
q+i  2q 

inequality  q+1 . The  first  2q  inequalities  are  thus  covered. 

In  a similar  manner  one  can  prove  by  finite  induction  that 
..  =d 

thus  unique  and  heflce  the  complete  optimum  solution  is  also  unique. 

Q.E.D. 


d,  =d. 
3q  4q 


= 1 while  other  d.  are  0.  The  assignment  is 
1 


The  extreme  regularity  of  the  looping  pipelines  and  their 
added  delays  makes  the  sharing  of  noncompute  segments  simpler  to 
consider,  as  the  following  theorem  indicates. 

Theorem  2.8.2:  All  the  elemental  delays  in  the  solution  given  by 

Theorem  2.8.1  can  share  a single  noncompute  segment. 

Proof : All  the  elemental  delays  can  share  a single  segment  if  no  two 

elemental  delays  are  a multiple  of  l apart  in  the  reconfigured 
reservation  table. 

Successive  elemental  delays  in  the  reconfigured  table  are 
(qs+1)  apart.  Therefore  the  distance  between  any  two  elemental  delays, 

diq  and  djq*  for  J > 1 is  ( j ~i) (qs+1 > - 

Suppose  this  distance  is  a multiple  of  l.  Then 
(j-i)qs  + (j-i)  = k JL  for  some  integer  k.  Dividing  by  Si  we  get 
r + * k where  r is  some  integer,  because  i|qs.  Thus  must  be 

an  integer.  We  show  a contradiction  by  proving  that  (j-i)/£  is  a 
fraction.  In  the  solution  of  Theorem  2.8.1  only  the  following  delays 
are  present. 


53 


Therefore 


dq,d2q’ ,d 


inis,  q ' 
q 


q q 


x. 


Thus  X>j>i,  which  implies  (j-i)/X  is  a fraction.  Thus  no  distance 
between  two  elemental  delays  is  a multiple  of  X and  therefore  all 
elemental  delays  can  share  a single  noncompute  segment.  Q.E.D. 

Example  2.8.1:  For  the  looping  pipeline  of  Figure  2.8.1a,  suppose  we 

want  to  have  the  cycle  (6)  allowable.  They  by  Theorem  2.8.1, 

X 6 

q gcd(X.s)  ” gcd  (6,3)  ~ 2’ 

and  thus  the  delay  assignments  = d^  = 1 and  dj.=d.j*d  =0  makes  the 

pipeline  allowable  for  constant  latency  6.  By  Theorem  2.8.2  d and  d 

2 4 

can  share  the  same  noncompute  segment.  The  solution  is  given  in 
Figure  2.8.1b  where  3^  is  the  noncompute  segment.  Verify  that  no  usage 
interval  of  Figure  2.8.1b  is  a multiple  of  6. 


2.9.  Arbitrary  Cycles 

So  far  we  have  discussed  only  constant  latency  cycles.  In 
this  section  we  deal  with  making  a pipeline  allowable  for  any  given 
constant  or  nonconstant  latency  cycle.  We  shall  also  present  a 
generalization  of  Theorem  2.3.1.  In  the  general  case,  the  allowability 
of  a pipeline  cannot  be  judged  solely  from  the  average  latency  of 
the  cycle.  All  we  can  say  immediately  is  that  for  a given  cycle  with 


lv 


54 


average  latency  l,  no  pipeline  with  lower  bound  m such  that  m> H can 
be  made  allowable.  But  what  about  the  pipelines  with  m <£?  Take 
for  example  the  cycle  (1,2,3).  It  has  an  average  latency  £= 2 
and  period  p = 6.  Let  us  try  to  find  an  allowable  pipeline  with  lower 
bound  m=2.  A pipeline  with  usage  interval  set  F is  allowed  by  cycle 
(1,2,3)  iff  the  set  of  initiation  intervals  G of  the  cycle  satisfies 
Since  for  the  cycle  (1,2,3)  is  [0, 1 ,2 , 3 ,4 ,5}  , must 

be  empty  for  a pipeline  to  be  allowable.  Thus  the  allowable  pipeline 
does  not  have  any  usage  intervals,  that  is,  each  row  of  the  reservation 
table  has  only  one  X.  Thus  the  only  allowable  pipeline  has  a lower 
bound  m=l  and  thus  no  allowable  pipeline  exists  with  lower  bound  m = 2. 

Let  us  define  a maximal  pipeline  with  respect  to  a cycle  as 
an  allowable  pipeline  which  has  the  highest  lower  bound  among  allowable 
pipelines . 


To  find  the  lower  bound  of  a maximal  pipeline  relative  to  a 

cycle,  we  can  construct  a reservation  table  with  as  many  X's  as 

possible  in  some  row  subject  to  satisfying  the  allowability  constraint, 

F (lG  = i.  The  constraint  may  be  rewritten  as  F c H , where  H is 
-p  -p  - -p--p  “P 

the  complement  of  G in  Z (see  Section  2.2). 

“P  “P 

Two  elements  i and  j,  such  that  i,j£Z  , are  defined  to  be 


compat ible 


if  | j-i|  € Hp. 


Using  this  definition  we  can  form  all  compatibility  classes 


of  the  cycle.  A compatibility  class  is  one  in  which  each  element  is 
compatible  with  every  other  element  in  the  class. 


In  the  theorems  to  follow,  we  establish  some  transformations 


on  a row,  which  do  not  change  the  allowability  of  the  row  and  give 
several  ways  of  constructing  an  allowable  row  from  a given  compatibility 
class.  Let  us  denote  a row  which  has  an  X in  each  of  columns 
t1,t2,...,tk  as  row  {t^jt^, . . . ,t^}  . Although  we  have  used  only  non- 
negative integers  for  column  numbers  in  all  our  examples,  there  is  no 
need  for  such  a restriction.  In  the  following  discussion  column 
numbers  are  assumed  to  be  integers  without  any  restriction. 

The  following  lemma  is  a useful  redefinition  of  compatibility, 
which  avoids  the  use  of  absolute  quantities. 

Lemma  2 . 9 . 1 . 1 : Two  integers  i, j € are  compatible  if  and  only  if 

(i-j)  mod  p 6 Hp. 

Proof : i and  j are  compatible  iff 

| i-j | € Hp  (by  definition). 

Now  | i-j | = | i-j | mod  p since  0 <L  i,j  < p.  Therefore  we  need  to  show 
that 

| i-j  | mod  p € Hp  <=>  (i-j  ) mod  p € Hp . 


If  i>j  then  it  is  trivially  true.  Let  j>i,  then  | i-j  | = (j-i). 
Hence  we  need  to  show  that, 


or 


(j-i)  mod  p 6 Hp  <=>  (i- j ) mod  p € Hp 
(j-i)  mod  P € Hp  <=>  p-  ((j-i)  mod  p)  € Hp. 


This  we  know  is  true  from  Property  2.2.3.  Therefore  i and  j are 
compatible  iff  (i-j)modp6H  . Q.E.D. 


56 


We  can  make  a similar  simplification  on  the  condition  of 
allowability;  that  is,  we  show  that  it  is  not  necessary  to  use  absolute 
quantities . 

Lemma  2 . 9 . 1 .2  : Given  a pair  of  X's  in  columns  t^  and  t^  of  a row,  the 
usage  interval  | 1 2 ~ C 1 ^ allowable  if  and  only  if  (t^ -t ^ ) mod  p G H . 
Proof:  By  Theorem  2.2.1,  the  usage  interval  | c2 _t 1 1 is  allowable  iff 

1 1 ^ ~ I modpSiHp.  Thus  we  need  to  show, 

1 t2”ti  I mod  P £ Hp  <=>  (t2~t1)  mod  p € Hp. 

If  t2>ti  then  ic  is  trivially  true.  Therefore  let  t^t  . Then  we 
need  to  show, 

(t^-t2)  mod  p G Hp  <=>  (t2~t^)  mod  p £ Hp 
or  ^tl"t2‘)  mod  P e Sp  <=>  p-[(t],-t2)  mod  p]  €Hp 

which  is  true  by  Property  2.2.3.  Q.E.D. 

Lemma  2. 9.1. 3 : Given  a set  of  integers  {t^t.^, . . . ,t  } , either  all 

or  none  of  the  following  reservation  table  rows  are  allowed  by  a cycle 
with  period  p. 

Row{[t^+k]  mod  p + i^p .[t^+k]  mod  p + i^p} 

for  all  integers  k.ij,!^, . . . ,i  . 

Proof : We  shall  pick  two  arbitrary  rows  from  the  above  set  and  show 

that  either  both  or  none  of  them  are  allowed  by  a cycle  with  period  p, 
that  is,  we  show  that 


57 


Row{[ t ^+z]  mod  p + j , . . . . , [ t^+z]  mod  p + j^p]  allowable 

<^=>  Rowl[  t i+z  ']  mod  p + j jp,  . . . ,[t  +z']  mod  p + j^p}  allowable  . 

Let  t and  be  any  two  members  of  the  set  [t^,  . . . .t^} . By  Lemma 
2. 9. 1.2,  we  need  to  show  that 

([  t +z]  mod  p + j p - [t  +z]  mod  p - j,  p)  mod  p € H 

c l cL  D D p 

<=>  ([ta+z'l  mod  p + j^p  - [tb+z'l  mod  p - j^P)  mod  p £ H^. 

But  we  have 

(r  t +z]  mod  p + j p - [t,  +z]  mod  p - j p)  mod  p 
a a d d 

= ([t  +z]  - [t^+z])  mod  p 

= ([  ta+z  ' ] - [ tb+z  ' ] ) mod  p 

= ([  t +z  1 ] mod  p - [ t, +z  1 ] mod  p)  mod  p 

SL  D 

= ([t  +z ']  mod  p + j 'p  - [t, +z ']  mod  p - j 'p)  mod  p. 
a a d d 

Hence  either  both  rows  or  neither  are  allowable.  Q.E.D. 

Theorem  2.9.1:  Let  { z^ , z^ , . . . , z^}  be  a compatibility  class  of  a given 

cycle  with  period  p,  then  all  of  the  following  rows  are  allowable. 

Row([z^+k]  mod  p + i1p,[z2+k]  mod  p + i2p,  . . . ,[zr+k]  mod  p + irp] 

for  all  integers  k,i^,...,ir. 

Proof:  For  any  two  elements  z_  and  zL  in  the  compatibility  class  we 

a d 

have  by  Lemma  2 . 9 . 1 . 1 

(za-zb)  mod  p £ 11^. 

And  by  Lemma  2.9. 1.2  then  usage  interval  | za~ zb | is  allowable. 

Hence  the  row  Iz^^j  • • • »z  } is  allowable.  By  setting 


T 


58 

k = = ••••= i = 0 we  see  that  rowtz^.z^, . . . .z^}  is  a member  of 

the  set  of  rows  defined  in  the  theorem  statement.  By  Lemma  2.9. 1.3, 
row! z^ , . . . , z^}  being  allowable  implies  that  all  of  the  rows  in  the 
theorem  statement  are  allowable.  Q.E.D. 

Sometimes  we  may  want  to  use  only  very  simple  transformations 
on  a compatibility  class  to  obtain  an  allowable  row.  The  following 
corollary  gives  three  such  transformations. 

Corollary  2. 9, 1.1:  Let  [z^z^  , . . . jZ^}  be  a compatibility  class  of  a 

cycle  with  period  p,  then  the  following  rows  are  allowable: 

(a)  Rowtz^-HCjZ^+k, . . • ,zr+k}  V integers  k 

(b)  Row{  (z^+k)  mod  p,  . . . , (z^+k)  mod  p}  V integers  k 

(c)  Rowlz^+^p, . . . ,zr+irp]  V integers  ^,...,1  . 

Proof : The  sets  of  rows  in  (a),  (b)  and  (c)  can  be  shown  to  be  subsets 

of  the  sets  of  rows  defined  in  Theorem  2.9.1.  Hence  the  rows  in  (a), 
(b)  and  (c)  are  allowable.  Q.E.D. 

We  might  point  out  here  that  the  union  of  sets  in  (a),  (b) 
and  (c)  of  the  above  corollary  does  not  necessarily  equal  the  set 
defined  in  Theorem  2.9.1,  but  if  (b)  is  applied  to  a given  row, 

{z^,z^, . . . ,z  } , and  (c)  is  applied  to  the  resulting  set  of  rows,  the 
resulting  set  is  equal  to  the  set  defined  in  Theorem  2.9.1. 

We  might  ask  the  question  whether  the  procedure  given  in  the 
Theorem  2.9.1  is  sufficient  to  construct  all  allowable  rows,  in  the 
sense  that  if  the  procedure  is  applied  to  all  the  compatibility  classes 
of  a cycle,  all  possible  rows  allowed  by  the  given  cycle  will  be 


w«V* 


i 

i 


59 


\ 

I 

I 

I 

I 


generated.  The  following  lemma  shows  that  even  the  simple  operation 
(c)  in  Corollary  2. 9. 1.1  is  sufficient. 

Lemma  2 . 9 . 2 , 1 : Given  a cycle  with  period  p and  the  set  C consisting 

of  all  its  compatibility  classes,  the  following  rows  are  the  only  rows 
which  are  allowed  by  the  cycle: 

Row  [z1+i1p,z2+i2p, . . . } 

for  all  integers  i^i^...  and  for  all  {z^z^...  } 6 C . 

Proof:  From  (c)  of  Corollary  2. 9. 1.1,  the  above  rows  are  allowable. 

Next  we  show  that  every  allowable  row  is  included  in  the  above  set  of 

rows.  Let  row{ t t2 , . . . , t^}  be  an  allowable  row.  They  by  Lemma  2. 9. 1.2, 

for  any  two  elements  t and  t,  of  the  row 

a b 

(t  -t,  ) mod  p € H 
a b'  -p 

let 

t^  be  expressed  as  (t^modpj+j^p  for  l<i<  k.  (1) 

Then  (t  -t,  ) mod  p e h 
a b - p 

=>  ([t  mod  p]  + j p - [t,  mod  p]  - i p)  mod  p € H 
a a b Jb 

=>  ([  ta  mod  p]  - [ tb  mod  p]  ) mod  p €Hp. 

We  also  have  that 

(tfl  mod  p)  € Zp  and  (t^  mod  p)  6 Z^. 

Therefore  by  Lemma  2.9.1. 1 (t  mod  p)  and  (t  .mod  p)  are  compatible 

S D 

and  hence  { mod  p) , . . . , (tfc  mod  p)}  is  a compatibility  class.  Then 
by  (c)  of  Corollary  2. 9. 1.1 

row(  (tx  mod  p)  + jjP, . . . , (tfc  mod  p)  +jkp) 


I 

I 

I 


I 

I 

I 

I 


60 

is  allowable  and  included  in  the  set  of  rows  defined  in  the  statement 
of  this  lemma.  By  (1)  this  row  is  identical  to  the  row  (t^,...,t^} 
and  hence  row  [t^,...,t^)  is  included  in  the  set  defined  by  this 
lemma.  Q.E.D. 

Theorem  2.9.2:  The  lower  bound  of  a maximal  pipeline  with  respect  to 

a given  cycle,  is  equal  to  the  size  of  the  largest  compatibility  class 
of  the  cycle. 

Proof : By  Lemma  2. 9. 2.1,  the  only  rows  which  are  allowed  by  a cycle 

with  period  p are: 

Row[z1+i1p,z2+i2p, . . . . ) 

for  all  integers  and  for  all  compatibility  classes 

(zi»z  ,...  } of  the  cycle.  Thus  the  maximum  number  of  X's  in  any 
allowable  row  is  equal  to  the  size  of  the  largest  compatibility  class. 
Thus  the  lower  bound  of  a maximal  pipeline  is  equal  to  the  size  of  the 
largest  compatibility  class.  Q.E.D. 

Now  to  gain  information  regarding  the  lower  bound  of  a maximal 
pipe  or  the  set  of  allowable  rows  with  respect  to  a cycle  we  need  to 
form  only  the  maximal  compatibility  classes  of  the  cycle.  A maximal 
compatibility  class  is  a compatibility  class  which  is  not  a subset  of 
any  other  compatibility  class. 

We  can  further  limit  ourselves  to  producing  only  those 
maximal  compatibility  classes  which  contain  element  0,  because  the 
other  classes  can  always  be  produced  by  adding  a constant  to  the 
members  of  some  compatibility  class  containing  0,  furthermore  these 
classes  (classes  not  containing  0)  do  not  yield  any  new  information 


■ r*  ■ 


I 

I 

I 


61 


I 

I 

I 

I 

I 


on  allowability  of  rows.  This  fact  is  formally  stated  in  the  following 
theorem. 

Theorem  2.9.3:  Let  [z^yZ^, • • . ,z^, . . . ,z^}  be  a compatibility  class  of 

some  given  cycle,  not  containing  0.  Then  there  exists  a compatibility 

class  containing  0,  such  that  the  use  of  Theorem  2.9.1  on  both  these 

classes  results  in  two  identical  sets  of  allowable  rows.  This  class  is 

[z, -z, ,z„-z. , . . . ,z, -z, , . . . ,z  -z.},  where  z.  is  the  smallest  element  of 
1 l 2 l i l r l ’ l 

the  given  class. 

Proof:  For  any  two  elements  z and  z,  of  the  given  class 

a b 

|za-z^|  6 11^  (by  definition  of  compatibility). 

Also  given  that. 


z , z,  , z . € Z 
a b’  i — p 


and 


z > z,  > z. , 
a b i* 


it  follows  that 


(z  -z.  ) 6 Z 
a i'  -p 


and 


(z,  -z.  ) € Z . 
v b i'  -p 


Thus  (z  -z.)  and  (z.-z. ) are  compatible  and  hence  lz.-z. ,z„-z. , . .., 

31  Di  112  1 

z^-z^  . . . .z^-z^]  is  a compatibility  class  containing  0.  The  sets  of 
allowable  rows  defined  by  Theorem  2.9.1  on  {z^,...,z  } and 
{z^-z^, . . . jZ^-zJ  are  clearly  identical  because  k of  Theorem  2.9.1 
takes  all  integer  values.  Q.E.D. 

Here  are  some  examples. 

Example  2.9.1:  Cycle  (1,9),  period  p=10. 

G1q  = {0,1,9}  , H1q  = 12,3,4,5,6,7,8}. 

Maximal  compatibility  classes  containing  0 are: 


62 


{o,2,4,6,8},[o,2,4,7},{o,2,5,7},{o,2,5,8}, 

{o, 3, 5, 7}, [o, 3, 5, 8}, [0,3, 6, 8}. 

The  cardinality  of  the  largest  class  is  5,  therefore  by 
Theorem  2.9.2  the  lower  bound  of  the  maximal  pipeline  with  respect  to 
cycle  (1,9)  is  5.  In  other  words  no  allowable  row  can  have  more  than 
5 X's.  D 

For  the  hand  calculation  of  maximal  compatibility  classes 
containing  0,  we  suggest  the  following  systematic  procedure.  For  a 
better  understanding  refer  to  Figure  2.9.1  when  reading  the  following 
procedure.  Figure  2.9.1  corresponds  to  the  generation  of  maximal 
compatibility  classes  of  the  cycle  (1,9)  of  the  above  example. 

Algorithm  2.9.1: 

Cl.  Form  .re  set  of  the  cycle  by  the  usual  procedures.  Write  0 at 
the  root  of  the  tree,  which  is  to  be  generated. 

C2 . Compute  the  immediate  successors  of  0 and  place  them  in  increasing 
order  from  left  to  right.  Where  the  immediate  successors  of  0 
are  all  the  integers  which  are  compatible  with  0 (these  are  simply 

the  elements  of  H ) . 

“P 

C3 . Pick  the  leftmost  leaf-node  in  the  tree  which  is  not  circled  or 
crossed  out;  call  this  node  N.  If  no  such  nodes  exist,  then 
terminate  this  procedure.  Each  set  of  nodes  on  the  path  from  the 
root  of  the  tree  to  a circled  node  corresponds  to  a maximal 
compatibility  class. 

C4 . Generate  the  immediate  successors  of  node  N and  place  them  in 

increasing  order  from  left  to  right.  The  immediate  successors  of 


64 


I 


I 


I 


I 

I 


N are  defined  to  be  those  brothers  of  N which  appear  to  the  right 
of  N and  whose  values  are  comtatible  with  the  value  of  N.  Nodes 
which  have  the  same  immediate  predecessor  are  brothers.  If  no 
such  successor  exists  then  check  if  this  compatibility  class 
(the  set  of  nodes  on  the  path  from  the  root  of  the  tree  to  node  N) 
is  a subset  of  a previously  generated  compatibility  class  (the  set 
of  nodes  on  the  path  from  the  root  of  the  tree  to  a circled  node). 
If  not  a subset  then  circle  the  node  N else  cross  it  out.  Go  to 
step  C3.  □ 

The  above  algorithm  works  in  the  following  manner.  For  each 
node  of  the  tree  we  generate  all  its  successors  which  are  compatible 
with  it  and  all  its  predecessors.  Thus  a node  and  its  brothers  are  the 
only  elements  which  are  compatible  with  the  node's  predecessors. 
Therefore  to  generate  a node's  successors,  one  only  needs  to  look  at 
the  brothers  of  that  node.  When  we  no  longer  have  a successor  to  a 
node  we  have  generated  a maximal  compatibility  class  or  a subset  of  an 
already  generated  maximal  compatibility  class.  The  above  algorithm 
can  be  speeded  up  by  taking  into  account  the  special  nature  of 
compatibility  classes.  This  is  discussed  below. 

By  Corollary  2.9.1.1b  if  1*^,...,*^}  is  an  allowable  row  of 
a cycle  then  for  any  integer  k,  [ (z^+k)  mod  p, . . . , (zr+k)  mod  p}  is  also 
an  allowable  row.  The  effect  of  adding  k modulo  p is  to  rotate  the 
X's  of  the  row.  That  is,  if  one  considers  a row  with  p cells  and  an 
X each  in  cell  z^z^, . . . ,zr  then  [ (z^+k)  mod  p, . . . , (z^+k)  mod  p}  is  a row 
obtained  by  end-aiound  shift  of  X's  by  k cells.  The  same  rotation 


65 

operation  is  also  applicable  to  compatibility  classes  since  a 
compatibility  class  is  also  an  allowable  row  and  an  allowable  row  is 
also  a compatibility  class  if  the  elements  of  the  row  are  between  0 
and  p-1.  Thus,  if  [0,z^, . . . ,z^, . . .}  is  a compatibility  class 
containing  0 then  { (0  + z'^)  mod  p , (z^+z\)  mod  p , . . . , (z^+  z^)  mod  p, . . . } is 
also  a compatibility  class  containing  0,  where  z^  = p-z^.  Therefore, 
if  given  all  the  maximal  compatibility  classes  containing  say  0 and  z, 
one  can  generate  all  maximal  classes  containing  0 and  p-z  by  a rotation 
operation.  Moreover,  one  can  also  generate  by  rotation,  all  classes 
which  contain  two  elements  with  difference  z;  i.e.,  all  classes  of  the 
type  { 0 , . . . ,i,i+z, . . .3 . These  two  properties  allow  us  the  following 
algorithm.  In  this  algorithm  as  soon  as  we  form  a maximal  compatibility 
class  we  also  form  all  its  rotations.  Once  all  the  classes  containing, 
say  i,  and  their  rotations  are  formed,  we  eliminate  (p-i)  from  the 
tree  and  restrict  ourselves  to  forming  compatible  pairs  with  a 
difference  strictly  greater  than  i.  The  tree  formed  by  the  following 
algorithm  for  cycle  (1,9)  is  given  in  Figure  2.9.2.  Compare  it 
with  Figure  2 .9. 1 • 

Algorithm  2.9.2  : 

Cl.  Form  the  set  H of  the  cycle  by  the  usual  procedures.  Write  0 
"P 

at  the  root  of  the  tree,  which  is  to  be  generated. 

C2.  Compute  the  immediate  successors  of  0 and  place  them  in  increasing 
order  from  left  to  right.  Where  the  immediate  successors  of  0 are 
all  the  integers  which  are  compatible  with  0 (these  are  simply  the 
elements  of  H ). 


2 


3 a % % r x 


the  tree 


and 

their  rotations 


10,2,4,6,8} 
t0, 2, 4, 7} 
10,2,5,7} 


: (0, 3, 5, 7}, [0, 3, 6, 8}, (O, 2, 5, 8} 

: 10,3,5,8} 


Figure  2.9.2.  Generating  maximal  compatibility  classes  of 
cycle  (1,9)  using  Algorithm  2.9.2. 


67 


r iea£-node  in  the  tree, 
c3.  pick  the  leftmost  leaf 

call  this  node  N; 

1 - value  of  ».  N and  place  them  in 

<*.  Generate  the  immediate  successors  o ^ succeSSors  of 

increasing  otdet  ««-  la£t  “ ' („hich  ate  not  crossed  out 

. are  «.*-*  »*  *"  ^ lined)  a»<  « — *“U  ^ 

_ _ least 

by  C5)  who.e  value.  * isolate  predeces.or 

( „ nodes  which  have  th 

the  value  O • exists  then  check 

ere  brother..  « » ~h 

,,.tn  «et  of  numbers  ou 

coWatlhiUty  cl...  <«  „evtou.ly  generated 

— — 

•.4V>ilitv  class,  i1  uv  rotation, 

compatibil  y maximal  classes  y 

a-  « and  generate  othe 

C5  Pick  the  leftmost  leaf-no  e terminate 

C „ Tf  no  such  node  exists 

crossed  out;  tall  it  »•  succe..or  of  0 then  cross  out 

. Tf  N is  an  immediate 

this  procedure.  d out>  and  set 

u ieaf -node  which  is  not  yet 

the  rightmost  ieai 

i«-  value  (N)  • Go  to  C4.  formed  directly  in  the 

- that  *u  "u 
tree  (l.e.  »ot  »»  tot“  ‘ theorem  already 

incorporate,  the  rotation  ^ „ characterise  ail  the 

2.,.2  the  following  tles.es  ere 
allowable  tows  of  cycle  (!.’>• 

lO,2,d.d.^0>2-‘>7UO'2'5’;'  ‘ 


68 


One  more  example  follows. 

Example  2.9.2:  For  cycle  (2,3,7),  period  p = 12 , 

g12  = {0,2,3,5,7,9,10},  h12  = {l,4,6,8,ll}. 

All  maximal  compatibility  classes  containing  0 are: 

{0,l}, {0,4, 8}, {0,6}, {0,ll}. 

Of  these  the  first  three  classes  are  sufficient  to  characterize  all  the 
allowable  rows  by  use  of  Theorem  2.9.1.  The  size  of  the  largest  class 
is  3,  hence  the  lower  bound  of  the  maximal  pipeline  is  3 (by  Theorem 
2.9.2).  This  means  that  no  pipeline  with  lower  bound  4 is  allowed  by 
the  cycle  (2,3,7),  even  though  the  average  latency  of  the  cycle  is  4. 
This  also  means  that  if  cycle  (2,3,7)  is  used  as  an  initiation  cycle 
on  some  allowable  pipeline  then  no  segment  is  used  more  than  3/4  = 757. 
of  the  time.  Thus  the  cycle  (2,3,7)  is  imperfect  in  some  sense.  □ 
Let  us  define  a cycle  to  be  perfect  if  the  lower  bound  of 
its  maximal  pipeline  equals  the  average  latency  of  the  cycle.  Thus 
the  cycle  (1,9)  of  Example  2.9.1  is  perfect.  There  is  no  straightforward 
method  to  test  the  perfectness  of  a cycle.  However,  the  following 
theorem  gives  a method  to  generate  a large  number  of  perfect  cycles. 
Lemma  2. 9. 4.1:  If  none  of  the  nonzero  elements  of  G of  a cycle 

-p 

(1^,1^, . . . ,in  ^)  is  divisible  by  n,  then  n|p  and  the  cycle  is  a perfect 
cycle. 


w. 


69 


Proof:  Let  l = ^ . We  show  that  [o,n,2n, . . . , (Jj-l)n}  is  a compati- 

bility class  of  the  cycle.  First  we  must  show  that  the  elements  of  the 
class  are  in  Z^.  All  elements  are  nonnegative,  therefore  it  is 
sufficient  to  show  that  (£-l)n<p. 

£ = , therefore  £ < + 1 hence  (f-l)n  < p. 

n ’ n ' r 

Now  suppose  [o,n, . . . , (£-l)n}  is  not  a compatibility  class. 

Then  2 two  elements  m^n  and  m^n  in  the  above  class  such  that 

0 < m,  < m„  < (£-1)  and  m.n  -m.n^H  . Therefore  (m  -m,)n6G  . 

— 12—'  2 1-p  2 1 ~p 

Now  m^  ^ m2*  Therefore  (m^-m^Jn  is  a nonzero  element  of  which  is 
divisible  by  n,  a contradiction.  Therefore  [o,n,2n, . . . , (£-l)n]  is  a 
compatibility  class  with  £ elements.  This  implies  that  the  given  cycle 
with  average  latency  p/n  is  allowed  by  some  pipeline  with  lower  bound 
£.  Since  no  segment  can  be  used  more  than  1007,,  of  the  time  we  must 
have,  £ < However,  £ = ^-  =>  £ > Therefore  £ = p/n.  Thus 

n|p  and  the  cycle  is  perfect.  Q.E.D. 

Theorem  2.9.4:  If  none  of  the  nonzero  elements  of  G of  a cycle 

-p 

(i0,i1,...,in_1)  is  divisible  by  n,  then  the  following  cycles  are 
perfect : 

cycle  ((i0+jQn)k,  (i-1+J]“)k» . .. , 

V integers  JQ» Ji» •• •» Jn-1  - 0 and  k ^ l- 

Proof : Let  G'  be  the  set  of  initiation  intervals  of  cycle 

( (i Q+j  Qn)k, .... (in.1+jn_in)k)  and  P'  be  its  Period.  Clearly  n|p', 
since  by  Lemma  2. 9. 4.1  n| (1q+1^+  *•*  + in_^).  let  p'/n  = f-k.  Then 


70 


we  show  that 

{0,l,2,...,(k-l) ,nk,nk+l , . . . , nk+(k-l)  , . . . , (£-l)nk, 

(£-l)nk+l, . . . , (£-l)nk+(k- 1)} 

is  a compatibility  class  with  £k  elements  and  hence  the  cycle  is 
perfect . 

Suppose  the  above  is  not  a compatibility  class,  then  there 

exist  two  elements  m,nk+m„  and  nunk+m.  in  the  above  class  such  that 

Id  2 4 

m^nk-Hn^  < n^nk+m^  and  0 < < (£-1),  0 < m^,m^  < (k-1)  and 

(m^nk+m^)  - (m^nk+m^)  € G ' mod  p'  . 

All  the  nonzero  elements  of  G^,  are  of  the  form  (g-hnn)k,  where 
g # 0 and  g € G^  and  integer  m > 0.  Therefore,  (m^nk+m^)  - (m^nk+m^)  = 
(g-Hnn)k.  Dividing  by  k,  we  get 

m2n  " min  + = g+mn- 

Since  0 < m^.m^  < k-1,  if  (m^-m^)/k  is  to  be  an  integer,  then  (m^-rn^) 
must  be  zero.  Thus, 

m^n  - m^n  = g + mn 

or  z 

m^-m^  - g'n  + m. 

This  is  a contradiction,  since  g ^ 0 and  n does  not  divide  g.  Therefore 

the  assumed  class  is  a compatibility  class  of  the  cycle 

((i0+jgn)k, . . . , (in_i+jn-in)k)  and  hence  the  cycle  is  perfect.  Q.E.D. 

A large  number  of  perfect  cycles  can  be  generated  using  the 
above  theorem,  provided  one  already  has  a cycle  which  satisfies  the 
nondivisibility  condition  of  the  above  theorem.  The  following 


I 

I 

I 


I 

I 

I 

I 


71 


corollary  uses  a very  simple  condition  to  generate  a set  of  perfect 
eye  les . 

Corollary  2. 9. 4.1:  If  i and  n are  relatively  prime  (i.e., 

gcd(i,n)  = l)  then  the  following  cycles  are  perfect: 

cycles ((i+jQn)k,  (i+j^k,  . . . , (i+Jn ,1n)k) 

V integers  Jq»Ji» • • • » Jn-1  - 0 and  k - 1. 

Proof : Gp  of  cycle  (i,i,...,i)  is  [o,i,2i, . . . , (n-l)i}  and  hence  no 

nonzero  element  of  is  divisible  by  n.  Then  by  Theorem  2.9.4  the 
above  cycles  are  perfect.  Q.E.D. 

The  following  is  a direct  consequence  of  the  above  corollary 
with  n set  to  1. 

Corollary  2. 9, 4. 2:  All  constant  latency  cycles  are  perfect.  □ 

Example  2.9.3:  gcd(2,3)  = 1,  therefore  by  Corollary  2. 9. 4.1 

cycle  (2, 2+3, 2+6)  = (2,5,8)  is  a perfect  cycle.  Fi/om  the  proof  of 
Theorem  2.9.4,  one  of  the  largest  compatibility  classes  is  [o, 3 ,6,9, 12} . 

Cycle  (2X2, 5X2, 8X2)  = (4,10,16)  is  also  perfect.  One  of  its 
compatibility  classes  having  (4+10+16)/3  = 10  elements  is,  from 
Theorem  2.9.4,  [0,1,6,7,12,13,18,19,24,25}.  □ 

Example  2.9.4:  For  cycle  (2, 1,2,3),  Gg  = [o, 1,2,3, 5, 6, 7}.  The  number 

of  initiations  in  the  cycle  is  n=4.  None  of  the  positive  elements  of 


G„  is  divisible  by  4.  Therefore  cycle  (2, 1,2, 3)  is  perfect  and  so  are: 


72 


Next  we  turn  our  attention  to  making  a pipeline  allowable  by 
use  of  noncompute  segments.  The  following  theorem  is  a generalization 
of  Theorem  2.3.1. 

Theorem  2.9,5:  For  a given  cycle  with  period  p,  a pipeline  can  be  made 

allowable  using  noncompute  segments  if  and  only  if  the  lower  bound  b 
of  the  maximal  pipeline  of  the  cycle  is  greater  than  or  equal  to  the 
lower  bound  m of  the  pipeline. 

Proof : The  'only  if'  part  is  trivially  true  from  the  definition  of 

the  maximal  pipeline,  namely,  that  no  allowable  pipeline  has  lower 
bound  greater  than  b . 

If  b > m then  the  pipeline  can  be  made  allowable  as  shown 
below,  by  making  every  row  of  the  reservation  table  of  the  form 
described  in  Lemma  2. 9. 2.1,  that  is,  every  row  is  of  the  form 

row{z1+i1P,z2+i2P, } 

for  some  compatibility  class  {z^,Z2»***3  °f  the  given  cycle  and  for 
some  integers  i^,i  , ... 

To  achieve  this  form,  scan  the  columns  of  the  given  reserva- 
tion table  from  left  to  right,  scan  each  column  from  top  to  bottom  for 
X's.  An  X may  be  moved  to  the  right  to  ensure  that  the  column  in  which 
this  X appears  has  a number  not  equivalent  modulo  p to  that  of  any 
previously  scanned  X in  the  same  row;  and  to  ensure  that  all  these 
column  numbers  modulo  p (in  the  same  row)  are  contained  in  some  single 
compatibility  class  of  the  cycle.  This  can  always  be  done  because  the 
largest  compatibility  class  has  b elements,  (by  definition  of  maximal 


73 


pipeline)  and  there  are  no  more  than  m X's  in  any  row,  where 
m < b . 

As  described  in  the  proof  of  Theorem  2.3.1,  moving  an  X to 
the  right  in  the  reservation  table  is  equivalent  to  inserting 
appropriate  noncompute  delays  in  the  pipeline.  Q.E.D. 

Examp le  2.9.3:  We  shall  illustrate  the  construction  of  the  above 

proof  on  the  reservation  table  of  Figure  2.9.3a.  Its  lower  bound  is 
m=2.  Suppose  we  want  to  make  this  pipeline  allowable  for  cycle 
(2,3,4).  First  we  verify  that  the  lower  bound  of  the  maximal  pipeline 
relative  to  the  cycle  (2,3,4)  is  greater  than  or  equal  to  2.  The  cycle 
has  period  p = 9 , 

^ = (0, 2, 3, 4, 5, 6, 7}  and  ^ = £ 1,8) . 

Maximal  compatibility  classes  containing  element  0 are,  { 0 , 1 } and 
£ 0 , 8 } . Therefore  the  lower  bound  of  the  maximal  pipeline  is  2. 

Figure  2.9.3b  is  a reconfigured  table  which  is  allowable.  Verify  that 
the  forbidden  set  F is  £8,1 0)  and  therefore 


Eq  = {1.8}  <=.  Hg  as  required.  a 

Next  we  consider  the  sharing  of  noncompute  segments  by  the 
elemental  delays.  We  shall  assume  that  during  one  time  unit  a non- 
compute segment  can  provide  exactly  one  elemental  delay.  The 
definition  for  the  incompatibility  between  two  elemental  delays  is 
similar  to  that  for  constant  latency  cycles  in  Section  2.4;  that  is, 
the  elemental  delays  d^  and  d^  are  incompatible  iff  |t^-tj|  modpCG^. 
Where  t^  and  tj  are  the  labels  of  the  columns  in  which  d^  and  d.  appear 


I 

I 

I 


0 12  3 4 74 


(a)  Starting  reservation  table. 


0 I 23456789  10  II 


(b)  Pipeline  (a)  made  allowable. 


0 I 23456789  10  II 


(c)  Assignment  of  elemental  delays  to 
noncompute  segments. 


Making  a pipeline  allowable  with  respect  to 
cycle  (2,3,4). 


75 


in  the  reconfigured  reservation  table.  Forming  the  incompatibility 
classes  is  not  very  useful  because  these  classes  are  not  disjoint, 
since  the  incompatibility  relation  is  not  transitive  as  it  was  for 
constant  latency  cycles.  Conversely  we  define  d^  and  d^  to  be 
compatible  if  1 1 ^ -t j | modpGHp.  We  form  all  the  maximal  compatibility 
classes  of  the  elemental  delays.  The  elements  of  a compatibility  class 
can  share  a single  noncompute  segment.  Now  the  problem  reduces  to  the 
standard  covering  problem,  that  is,  finding  the  minimum  number  of 
compatibility  classes  which  cover  all  the  elemental  delays. 

Consider  the  previous  example.  Figure  2.9.3b  shows  the 
elemental  delays  d^  through  d^.  For  the  given  cycle  (2,3,4), 

Hg  is  £ 1 , 8 3 . Then  the  maximal  compatibility  classes  of  elemental 
delays  are  : 


{d1,d2},U2,d3),£d3,d4),td4,d5},td5,d6},{d1,d7}, 

One  of  many  possible  minimal  covers  is  [d,,d7),{d  ,d  ),[d. ,d  ],(d  ,d  }, 

1/  2 J 45  jo 

Thus  4 noncompute  segments  are  required  to  provide  all  the  elemental 

delays.  When  the  maximal  compatibility  classes  of  a minimal  cover  are 

assigned  to  noncompute  segments,  elemental  delays  appearing  in  more 

than  one  maximal  compatibility  class  are  deleted  from  all  compatibility 

classes  but  one,  to  obtain  a disjoint  set  of  compatibility  classes. 

This  set  of  compatibility  classes  now  forms  a minimal  disjoint  cover. 

One  such  possible  assignment  is  shown  in  Figure  2.9.3c,  where  S„ , S, , 

3 4 

and  are  noncompute  segments. 

Of  course  the  same  techniques  could  have  been  used  in  the  case 
of  constant  latency  cycles.  However,  the  techniques  described  in 


76 


Section  2.4  are  simpler  to  use  for  constant  latency  cycles.  Take 
the  example  of  Figure  2.4.1  discussed  in  Section  2.4.  There  were  4 
incompatibility  classes  of  the  elemental  delays  d^  through  d^, 
instead  if  we  generated  compatibility  classes,  there  would  have  have 
been  36  of  them  and  finding  a minimal  cover  would  have  required 
considerable  effort. 

Our  next  concern  is  making  a pipeline  allowable  for  a non- 
constant latency  cycle  with  minimum  added  execution  delay.  The  problem 
formulation  is  identical  to  the  constant  latency  problem  described  in 
Section  2.6.  In  principle  the  constraints  are  still  the  same,  namely 
Fp  fl  Gp  = The  only  difference  being  that  is  no  longer  always 

equal  to  (oj  and  therefore  the  constraints  2.6.2  are  changed  to  read, 

(d ' + £ max  (d  ) +d  + (c-b) ) mod  p (?  G (2.9.1) 

ab  b<j<c  0<i<I  iJ  ac  * ~P 

for  each  pair  (X,,X  ) with  c>b.  In  the  above,  "€  H " may  be 

ab  ac  p 

substituted  for  G "•  The  objective  function,  D,  is  unchanged.  The 
algorithm  to  obtain  an  optimum  solution  is  also  identical  to  the  one 
presented  in  Section  2.7.  Before  executing  the  algorithm  it  is  not 
necessary  to  determine  whether  the  pipeline  can  be  made  allowable  by 
comparing  the  lower  bound  of  the  given  pipeline  to  that  of  the  maximal 
pipeline  relative  to  the  cycle.  The  algorithm  terminates  without  a 
solution  if  no  solution  exists. 

Continuing  with  Example  2.9.5,  an  optimum  solution  to  make 
the  pipeline  of  Figure  2.9.3a  allowable  to  cycle  (2,3,4)  is  given  in 


I 

I 

I 

! 

I 

0 I 23456789  10 

So 
S, 

S2 


[x 

IdJ 

d2 

d4 

d6 

X 

X 

X 

(a)  Minimum  delay  solution  to  pipeline 
of  Fig.  2.9.3a. 


0 I 23456789  10 


(b)  Optimum  assignment  of  delays. 


Figure  2.9.4.  Making  a pipeline  allowable  with  respect  to 
cycle  (2,3,4). 


78 


Figure  2.9.4a.  An  optimum  assignment  of  elemental  delays  to  noncompute 
segments  for  that  solution  is  given  in  Figure  2.9.4b. 


2.10.  Multifunction  Pipelines 

Most  of  the  results  concerning  single  function  pipelines  can 
be  extended  to  multifunction  pipelines.  In  this  section  we  shall 
present  most  of  these  results  without  proof.  Even  though  the  notational 
complexity  increases  considerably,  the  concepts  still  remain 
straightforward . 

Let  F^  be  the  set  of  usage  intervals  of  all  (x,Y)  pairs 
in  the  reservation  table.  This  set  can  be  formed  from  the  reservation 
table  by  taking  all  pairwise  distances  between  an  X and  a Y which 
appears  to  the  right  of  the  X in  the  same  row.  If  both  X and  Y appear 
in  the  same  cell  of  the  reservation  table  then  the  pairwise  distance 
is  considered  to  be  zero.  An  example  is  shown  in  Figure  2.10.1. 

Similarly  we  define  G , the  set  of  initiation  intervals  of 

A. I 

all  (x,Y)  pairs  of  a cycle,  to  be  the  set  which  contains  all  intervals 
of  an  operand  of  type  X from  a previously  initiated  operand  of  type  Y. 

Note  that  X and  Y are  used  as  variables  in  the  definitions  and 
in  the  following  discussion.  They  take  on  values  from  the  set  of 
function  types;  the  values  need  not  be  distinct.  The  following  are 
generalizations  of  Properties  2.2.1,  2.2.2  and  2.2.3. 


4 


80 


Property  2.10,1:  a.  0 g 0.  mod  p and  ip  £ G V i > 1 always. 

"aA  ~AA 

b.  If  g^O  then  gSG^modp  =>  g + ip^G^  V i > 0. 

c . If  X^Y  then  g € G^  mod  p =>  g + ip  £ 0.y  V i ^ 0 . 

Property  2.10.2:  a . 0 € G mod  p <=>  0 € G mod  p . 

XY  ~YX 

b.  If  g^O  then  g € £xy  mod  p <=>  (p-g)  € Gvv  mod  p. 


— YX 


Example  2.10.1:  Cycle  ^a»2b»5a’^C,0B^  ’ Period  P*15. 


G mod  15  = (0,7,8} 

AA 

G^  mod  15  = {4,5,11,13} 

Ga_  mod  15  = {4,ll} 

“AC 


CL  . mod  15  = {2,4,10,11} 

dA 

G^  mod  15  = {0,6,9} 

-BC  m°d  15  = ^°’6^ 


G^  mod  15  = {4,11} 
G^  mod  15  = {0,9} 
mod  15  = {0} 


Let  %Y  be  the  comPlement  of  the  set  Gxy  in  the  set  of 
nonnegative  integers.  Then 


Q 


□ 


□ 


mod  p = Zp  - (G^  mod  p)  . 

mod  p <=>  0 G l^,x  mod  p . 
b.  If  h 0 then  h 6 mod  p <=>  (p-h)  € Hyx  mod  p.  □ 

Having  defined  £xy,  G^  and  H^,  we  state  the  following  theorem 
which  is  a generalization  of  Theorem  2.2.1. 

Theorem  2.10.1:  A cycle  with  period  p is  allowed  by  a multifunction 

pipeline  if  and  only  if 

(£xy  mod  P)  n (Gj^y  mod  p)  = i 
or  equivalently  if  and  only  if 

(-XY  m°d  p)  — %y  mod  P 

V X,Y € the  set  of  function  types  present  in  the  cycle.  □ 


Property  2.10.3: 


0 £ 


%Y 


81 


Note  that  the  set  of  function  types  need  contain  only  those 
names  which  appear  in  a cycle  rather  than  all  function  types  which 
appear  in  the  reservation  table. 

Let  us  represent  a row  by  the  set  of  integers  with  sub- 
scripts, where  the  integers  are  column  numbers  and  subscripts  are 

function  names.  Thus  for  example  a row  [4.,2  ,2„,7  ,5  } means  that  A 

A d C A C 

appears  in  columns  4 and  7,  B appears  in  column  2 and  C appears  in 
columns  2 and  5 of  that  row. 

Given  a cycle  with  period  p,  compatibility  is  defined  as 

follows : 

Two  elements  i and  j , such  that  i,j  £ Z and  X,Y  £ the  set 

A Y 

of  function  types  are  compatible  if  for  j>i  (j -i)  £ mod  p (or  if  for 

j (i-j)  e liyXmod  P)- 

The  following  lemma  follows  from  Property  2.10.3  and  gives 

a convenient  way  to  test  compatibility,  which  avoids  the  comparison  of 

i and  j.  This  is  a generalization  of  Lemma  2.9. 1.1. 

Lemma  2.10.2.1:  Two  elements  i and  j , such  that  i,j€Z  and  X,Y  £ 

A Y “P 

the  set  of  function  types,  are  compatible  if  and  only  if 
(j"i)  mod  p f H^,  mod  p.  □ 

The  following  are  generalizations  of  Lemmas  2.9. 1.2,  2. 9. 1.3, 
Theorem  2.9.1,  and  Corollary  2.9.1. 1 respectively. 

Lemma  2.10.2.2:  Given  a pair  of  functions  (x,Y)  in  columns 

a row,  the  usage  interval  is  allowable  if  and  only  if 

(t2-t1)  mod  p £ mod  p.  □ 


82 


I 

I 

Lemma  2.10.2.3:  Given  a set  { i » j } either  all  or  none  of  the 

A I 

following  rows  are  allowed  by  a cycle  with  period  p. 

Row  {[  (i+k)mod  p +i1p]x>[  (j+k)mod  p 4-Jj^pIy, } 

V integers  k.i^  j , ...  □ 

Theorem  2.10.2:  Let  £ i > j } be  a compatibility  class  of  a given 

A i 

cycle  with  period  p then  all  of  the  following  rows  are  allowable. 

Row  {[  (i-Hc)mod  p +i1P]x>C  (j -He)  mod  p +j1plY» 3 

V integers  k,^, j ^ □ 

Corollary  2.10.2.1:  Let  [i  ,j  , ...}  be  a compatibility  class  of  a 

A I 

given  cycle  with  period  p,  then  the  following  rows  are  allowable 

(a)  row  {[ i+k1x> [ j+k]Y> . . . . } V integers  k 

(b)  row  {[  (i+k)  mod  p)]  ,[  (j-fk)  mod  p]  , . . . . } V integers  k 

A I 

(c)  row  {[i+i1plx,[ j+j1p]Y> 3 v integers  i ° 

The  following  lemma  (generalization  of  Lemma  2.9.2. 1) 
establishes  the  sufficiency  of  the  procedure  of  Theorem  2.10.2  in 
forming  allowable  rows. 

Lemma  2.10.3.1:  Given  a cycle  with  period  p and  the  set  C consisting 

of  all  its  compatibility  classes,  the  following  rows  are  the  only  rows 
which  are  allowed  by  the  cycle. 

^ Row  [[i+ijP^.t j+j^Y, 3 

V integers  i^j^...  and  V {ix>  jy, . . . 3 € C.  □ 

I 

I 

I 


83 


Now  we  form  the  maximal  compatibility  classes  from  a given 
cycle.  The  maximal  compatibility  classes  are  all  we  need  to  form  any 
allowable  row.  As  before  we  need  form  only  those  classes  which  contain 
at  least  one  0 element;  i.e.,  0 for  some  function  X.  To  generate 

A 

maximal  compatibility  classes  we  may  use  Algorithm  2.9.1  with  slight 
modification.  First  of  all,  the  "greater  than"  relation  should  be 
properly  defined.  Let  us  call  i greater  than  j if  i > j . If  i = j 

A X 

then  any  fixed  order  can  be  imposed  on  the  function  types,  e.g.,  we 


might  say  that  i^<  i^<  i^. . . , etc.  Next,  we  must  generate  one  tree 

for  each  function  type  with  roots  0.,  0D,  CL,  ...,  etc.  To  avoid 

A d C. 

duplication,  0 is  precluded  from  appearing  in  any  tree  with  0 as  a 
X Y 

root  if  0y>^x‘  f°^0'w:*-n8  example  will  illustrate  these  changes. 

Example  2.10.2:  Cycle  (lA,lg,2A)  Period  p = 4. 


The  sets  of  initiation 

intervals  modulo  p 

G. , mod  4 = {o,l,3} 
-AA 

G.  mod  4 = {2,3} 

— AB 

G . mod  4 = {l,2} 
dA 

Ggg  mod  4 = {0} 


The  sets  of  allowable  usage 
intervals  modulo  p 

H.  . mod  4 = {2} 

-AA 

H.  D mod  4 = {0,l} 

""Ad 

mod  4 = {0,3} 

HBB  mod  4 = {1,2,3} 


Two  trees  are  generated  with  roots  0A  and  0fi,  using  Algorithm  2.9.1. 
Here  we  have  assumed  i <i  . The  trees  are  given  in  Figure  2.10.2. 

A 13 

Maximal  compatibility  classes  containing  0 are: 

A 


^ °A ’ °B ’ ^ ^°A,2A^  ’^°B*  1B,2B’3B'  ’^°B,3A,3B^  ' 


□ 


Figure  2.10.2.  Generating  the  maximal  compatibility  classes  of 
1 _ / 1 1 <-»  \ 


cycl«  (1  ,1,,2  >. 


AD-A038  338 


UNCLASSIFIED 


ILLINOIS  UNI V AT  URB ANA-CHAMPAIGN  COORDINATED  SCIENCE  LAB  F/G  9/9 
IMPROVING  THE  THROUGHPUT  OF  PIPELINES  WITH  OELAYS  AND  BUFFERS* (U) 
OCT  76  J H PATEL  OAAB07-72-C-0259 

kii 


eZ  of9 
A2o3B338 

. a 

ii 

;j  — 

85 


% 


A compatibility  class  is  said  to  cover  another  class  if 
for  each  function,  the  number  of  elements  of  that  function  type  in  class 
is  greater  than  or  equal  to  the  number  of  elements  of  the  same 
function  type  in  class  C 

In  the  above  example  {0.,0_,1„}  and  {0_,3.,3.J  cover  each 

Add  d A d 

other.  We  may  apply  the  same  definition  for  cover  among  rows  and  also 
between  a row  and  a compatibility  class.  Thus  for  example  a 
row  lo^,2^,3g,4c,5B}  covers  a class  tOg.2^,5^},  but  does  not  cover  a 
class  {C>C,4C}. 

Theorem  2.10.3:  For  a given  cycle,  a multifunction  pipeline  can  be 

made  allowable  by  delaying  some  computation  steps  if  and  only  if  each 
row  of  the  reservation  table  is  covered  by  at  least  one  compatibility 
class  of  the  cycle. 

Proof : The  'only  if'  part  follows  directly  from  Lemma  2.10.3.1, 

because  every  allowable  row  of  a cycle  is  covered  by  at  least  some 
compatibility  class.  Thus  no  allowable  row  can  exist  which  is  not 
covered  by  some  compatibility  class  of  the  cycle. 

Now  the  rest  of  the  proof  is  similar  to  the  proof  of  Theorem 
2.9.5.  Each  row  can  be  made  to  look  like  one  of  the  rows  of  Lemma 
2.10.3.1  by  delaying  some  computation  steps.  This  can  always  be  done 
because  for  each  row  there  is  at  least  one  compatibility  class  which 
covers  it.  Q.E.D. 

Now  the  problem  of  making  a pipeline  allowable  can  be 
formulated  in  the  following  manner. 


I 


f 


I 


I 

I 

I 

I 

I 

I 


86 


First  of  all  we  need  to  generalize  the  objective  function, 
namely,  the  added  execution  delay.  Each  function  has  its  own  execution 
delay.  We  may  not  be  able  to  minimize  all  these  delays  simultaneously. 
Therefore  we  shall  minimize  some  function  of  the  execution  delays.  Let 
D(X)  denote  the  added  execution  delay  for  function  X.  Then  the 
objective  function  might  be  thought  of  as  some  function  of  D(X)'s  which 
is  monotonically  increasing  in  each  D(X);  e.g.,  some  linear  combina- 
tion of  D(X)'s  with  positive  coefficients. 

Each  function  has  its  own  set  of  elemental  delays.  The 
variable  d^,  the  number  of  elemental  input  delays  inserted  before 
row  i column  j (defied  in  Section  2.6),  needs  a third  subscript  here 
to  indicate  which  function  it  is  to  delay.  Thus  (X)  shall  refer  to 
d^j  for  function  X.  Now  the  added  execution  delay  for  each  function 
in  a reservation  table  with  I rows  and  J columns  can  be  expressed  as 
below: 


D(X) 


Z 

0£j<J 


max  d, .(X). 
0<i<I  1J 


(2.10.1) 


Next  we  form  th*.  constraints.  The  usage  interval  due  to  an  X in  cell 
(a,b)  and  a Y in  cell  (a,c)  for  c>b  can  be  written  as 

[ (the  distance  of  Y4C  from  column  0)  - (the  distance  of  Xgb 
from  column  0)]. 


Thus  the  complete  set  of  constraints  can  be  expressed  from  Lemma 
2.10.2.2  in  the  following  manner. 


- -jb  ;■ 


I 

I 


[(  Z max  d (Y)+d  (Y)+c) 

0<j<c  CKKI  ac 

- ( £ max  d (X)  +d  . (X)  +b)]  mod  p t L mod  p 

0<j<b  0<i<I  iJ  aD  XY 

(or  € mod  p) 

for  each  pair  (Xak>Yac),  ordered  arbitrarily. 


The  algorithm  to  obtain  an  optimum  solution  is  the  same  as 
Algorithm  2.7.1. 


2.11.  Concluding  Remarks 


We  have  shown  that  under  certain  conditions  a pipeline  can  be 
modified  by  simply  inserting  noncompute  segments  so  that  the  pipeline 
allows  some  given  initiation  cycle.  In  general,  the  condition  which 
must  be  satisfied  is  that  every  row  of  the  reservation  table  must  be 
covered  by  some  compatibility  class  of  the  given  cycle.  In  particular, 
for  single  function  pipelines  we  saw  that  if  the  cycle  is  a constant 
latency  cycle  with  latency  greater  than  or  equal  to  the  lower  bound  of 
the  pipeline,  the  condition  is  always  satisfied.  Thus  it  is  always 
possible  to  add  delay  to  a pipeline  to  achieve  maximum  throughput  with 
constant  cycles.  For  nonconstant  latency  cycles  in  a single  function 
pipeline  the  lower  bound  of  the  pipeline  must  be  less  than  or  equal  to 
the  size  of  the  largest  compatibility  class  of  the  cycle,  which  in  turn 
may  be  less  than  the  average  latency  of  the  cycle.  For  multifunction 
pipelines,  we  use  the  general  condition. 


MB* 


88 


We  presented  the  techniques  for  maximum  sharing  of  non- 
compute segments.  However,  the  effect  of  such  sharing  on  overall  cost 
is  technology  dependent  and  is  not  presented. 

We  presented  a branch-and-bound  algorithm  to  find  an  optimum 
solution  (if  one  exists)  to  the  problem  of  making  the  pipeline  allowable 
for  a particular  cycle.  An  optimum  solution  is  defined  to  be  the  one 
which  uses  the  least  amount  of  added  execution  delay. 

We  do  not,  however,  have  an  algorithm  to  find  a globally 
optimum  solution;  that  is,  simply  given  an  average  initiation  latency 
(and  a function  mix  for  multifunction  pipelines)  find  a cycle  for  which 
the  added  execution  delay  is  a minimum  among  all  cycles  with  the  given 
average  latency  and  function  mix.  This  is  by  no  means  a simple  problem. 
There  are  infinitely  many  distinct  cycles  which  have  the  same  average 
latency  and  function  mix.  Only  a small  fraction  of  these  cycles  (but 
still  infinite  in  number)  can  be  made  allowable  for  a given  pipeline. 

We  know  very  little  about  cycle  structure.  Even  to  determine  whether 
a cycle  is  perfect  requires  cumbersome  procedures  of  forming  maximal 
compatibility  classes  except  for  a few  cases.  Furthermore,  in  general, 
the  longer  the  period  of  the  cycle,  the  larger  the  number  of  compati- 
bility classes  (especially  so  for  multifunction  cycles).  We  can  only 
make  a heuristic  suggestion  for  choosing  a 'good'  cycle,  namely, 
choose  a cycle  with  as  small  a period  as  possible  and  choose  a cycle 
which  has  a fairly  regular  pattern  of  latencies  and  functions.  Thus, 
for  example,  we  would  choose  a cycle  (4^,4B>4C)  over  a cycle 

(2. ,5_,3„,1_,101,3  ) with  same  average  latency  and  function  mix. 

A D U C AC 


89 


This  choice  is  made  because  cycles  with  only  a few  distinct  latencies 
are  likely  to  have  small  G sets  and  hence  more  freedom  in  the  choice  of 
allowable  usage  intervals. 

Once  the  pipeline  has  been  modified  to  allow  some  desired 
cycle  how  do  we  accommodate  a different  arrival  pattern?  For  single 
function  units  this  is  not  a difficult  problem.  A buffer  at  the  input 
of  the  pipeline  is  all  we  need  to  change  any  arrival  pattern  into  a 
desired  allowable  cycle.  Guidelines  for  choosing  an  appropriate  size 
of  buffer  for  random  arrivals  and  constant  departures  are  discussed  in 
[DOR.67]  and  [PHI70].  A buffer  can  also  be  used  for  multifunction 
pipelines  to  convert  the  arrivals  into  a desired  initiation  cycle. 
However,  this  scheme  may  be  too  restrictive,  because  we  have  to  assume 
that  the  function  mix  remains  constant  on  average  and  that  the  order  of 
task  arrivals  is  unimportant,  so  the  tasks  can  be  shuffled  to  suit  the 
cycle  before  initiation.  These  restrictive  assumptions  can  be  relaxed 
if  we  use  a scheduling  strategy,  such  as  the  greedy  strategy,  which 
adapts  itself  to  changes  in  input  patterns,  but  may  possibly  have  to 
incur  reduced  throughput.  Thus,  even  though  we  have  been  able  to 
improve  the  throughput  by  use  of  noncompute  segments,  we  still  do  not 
have  enough  flexibility  in  initiations  when  a particular  initiation 
cycle  is  not  required  by  the  operating  environment.  We  hope  to 
achieve  high  throughput  and  flexibility,  when  required,  with  the  use 
of  internal  buffers  as  discussed  in  the  next  two  chapters. 


90 


Chapter  3 

INTERNAL  BUFFERS  WITH  PERIODIC  ARRIVALS 
3.1.  Introduction 

The  allowable  initiation  sequences  of  a pipeline  are  a 
function  of  the  usage  intervals  of  each  segment.  Once  the  usage 
intervals  are  fixed,  we  have  only  a limited  choice  of  initiation 
sequences.  Since  this  limitation  may  restrict  the  flexibility  in 
scheduling,  we  may  think  of  changing  the  usage  intervals  dynamically, 
for  operating  in  environments  in  which  restricted  Initiation  sequences 
are  undesirable.  Instead  of  using  noncompute  segments  which  provide  a 
fixed  delay,  we  can  use  buffers  between  segments  to  provide  a variable 
delay.  Once  buffers  are  used,  it  is  convenient  to  use  them  to  simplify 
the  scheduling  problem  as  well.  We  shall  use  buffers  to  resolve 
collisions  in  the  following  sense.  Whenever  two  or  more  tasks  are 
trying  to  use  the  same  segment  at  the  same  time,  one  of  the  tasks  is 
allowed  to  use  the  segment  while  the  rest  of  the  tasks  are  stored  in 
the  buffer  to  wait  their  turn  according  to  some  priority  mechanism.  In 
this  sense  the  buffers  might  be  thought  of  as  adaptive  delay  elements. 

In  this  chapter  we  shall  examine  various  priority  schemes. 
Throughout  this  chapter  we  assume  that  arrivals  are  periodic.  Random 
arrivals  are  treated  in  the  next  chapter.  Unless  otherwise  specifically 
mentioned  we  also  assume  that  no  task  uses  more  than  one  segment  in  one 
time  unit.  This  is  to  say  that  there  is  no  parallel  computation  within 
a single  task.  Justification  for  this  assumption  should  be  clear  later 


on  when  we  discuss  the  implementation  of  priority  schemes  in  practice. 
As  before,  we  also  assume  that  each  computation  step  is  one  time  unit 
long. 


P 

N 

"x 

n 

miX 


T 


i 


t 


i 


<V'j! 


The  following  symbolism  is  used  consistently  in  this  chapter. 

the  period  of  an  initiation  cycle. 

the  set  of  function  types  present  in  the  cycle. 

the  number  of  task  initiations  of  type  X in  a cycle. 

the  total  number  of  task  Initiations  in  a cycle  * £ n,. 

X€n  x 

the  number  of  time  units  that  segment  i is  used  for  a task 
of  type  X. 

task  i,  which  is  the  i1-*1  arrival  at  the  pipeline,  where  the 
arrivals  are  numbered  0,1,2,...  If  2 or  more  tasks  of 
different  types  arrive  simultaneously  then  they  are  ordered 
arbitrarily,  this  ordering  however,  must  be  maintained  every 
period.  We  assume  though,  that  no  more  than  one  task  of  the 
same  type  can  arrive  simultaneously. 

the  instant  in  time  after  t^  units  of  time  have  passed  by. 

We  assume  that  any  transfers,  arrivals  and  departures  of 
tasks  can  only  occur  at  these  discrete  time  instants, 
the  time  interval  between  instants  t^  and  t j , which  is 
(tj-t^)  units  long. 

Given  an  initiation  cycle  and  a pipeline,  define  the 


workload  of  a segment  to  be  the  average  busy  time  required  of  the 
segment  to  process  the  tasks  at  the  input  rate: 


92 


w,  * the  workload  of  segment  1 * ( £ n m )/p. 

X€n  x 1X 

We  may  refer  to  the  maximum  workload  over  all  segments  as  the 
workload  of  the  pipeline.  A workload  of  100%  is  the  highest  load  which 
can  be  processed  by  a pipeline.  Thus  in  a single  function  pipeline,  a 
workload  of  1007,  corresponds  to  average  input  latency  being  equal  to  the 
lower  bound  latency  of  the  pipeline. 

Unless  specifically  mentioned,  we  shall  assume  that  workload 

is  £ 100%. 

When  it  is  clear  that  the  pipeline  under  consideration  is  a 
single  function  pipeline,  we  shall  omit  the  subscript  X in  all  of  the 
above  symbols  for  simplicity. 

If  the  range  of  X is  not  specified  then  the  range  should  be 
taken  over  all  valid  function  types.  Specifically, 

T.  f(x)  = E f (X) , 

X X€N 

and 

max  f (X)  - max  f (X) . 

X X6N 


Example  3.1.1: 

Consider 

the  cycle 

(2a,3b,0a,2c*V  and 

the  reservation 

table  of  Figure 

3.1.1. 

Then 

p - 2 +3  +0+2  +5  * 

12  N - lA,B,c}. 

"a"2 

nB"2 

-C-1 

"oa‘3 

m0B“° 

“oc  • 2 

wQ  - (6+0  +2)/l2 

= 66.6% 

"ia‘2 

mlB  " 3 

“1C  ■ 1 

\i1  - (4  +6  +1)/12 

= 91.6% 

"2A  " 1 

"2B  ’ 1 

"2C  ' 2 

w2  - (2  +2  + 2)/12 

■=  50.0% 

□ 


94 


3.2.  Priority  Schemes 

The  priority  schemes  that  come  to  mind  initially  are  First-In- 
First-Out  and  Last-In-First-Out.  Therefore  we  start  our  discussion 
with  these  two  priorities. 

FIFO-global : A task  T^  has  a priority  over  all  tasks, 

Ti+k  (^>0),  which  arrive  later. 

LIFO-global : A task  T^  has  a priority  over  all  tasks, 

^ (k>0),  which  arrive  earlier. 

As  an  example  take  the  reservation  table  of  Figure  3.2.1a. 

A schematic  of  this  pipeline  without  buffers  is  given  in  Figure  3.2.1b. 
Implementation  of  buffers  with  priority  FIFO-global  or  LIFO-global  can 
be  represented  as  in  Figure  3.2.1c. 

We  should  stress  here  that  the  buffers  in  Figure  3.2.1c 
themselves  are  not  FIFO  queues  or  LIFO  stacks.  The  actual  implementa- 
tion of  the  priority  scheme  FIFO-global  and  LIFO-global  is  more  complex 
than  the  impression  we  get  from  Figure  3.2.1c.  One  possible  imple- 
mentation is  described  below. 

Each  task  carries  a tag  which  is  initialized  to  zero  at  its 
arrival.  At  every  time  unit  the  tag  is  incremented  by  one.  The  tag 
number  now  uniquely  defines  the  priority.  In  the  case  of  FIFO-global, 
the  priority  increases  with  the  value  of  the  tag  while  in  the  case  of 
LIFO-global  a higher  valued  tag  implies  a lower  priority  for  that  task. 
Assuming  the  buffers  always  have  the  tasks  ordered  according  to  their 
priorities,  every  task  that  arrives  at  a buffer  must  be  merged  into 


the  buffer  according  to  the  value  of  its  tag,  this  by  itself  is  a 


complex  operation.  When  a task  uses  some  segment  more  than  once, 
it  must  also  carry  some  information  about  its  state  of  computation. 

The  computation  state  or  the  processing  state  of  a task  is  defined  to 
be  the  number  of  computation  steps  the  task  has  gone  through.  This 
information  is  required  to  control  the  flow  between  segments  and  to 
allow  the  actual  processing  done  by  a segment  to  differ  for  different 
computation  steps.  One  more  tag  is  required  for  each  task  in  a 
multifunction  pipeline  to  identify  the  function  type  of  each  task. 

We  shall  show  later  that  FIFO-global  and  LIFO-global 
strategies  can  be  implemented  in  a much  simpler  way  if  the  pipeline 
is  a single  function  unit.  For  multifunction  pipelines,  however,  we 
do  not  have  a simpler  implementation,  and  we  must  look  for  other 
priority  schemes  which  are  simpler  to  implement. 

To  visualize  the  task  flow  it  is  useful  to  think  of  three 
distinct  steps  at  each  time  instant.  First,  the  tasks  arrive  in  the 
buffers  from  segments  or  from  outside.  Second,  the  priority  is  computed 
and  tasks  with  the  highest  priority  in  each  buffer  are  selected.  Third, 
the  tasks  thus  selected  leave  the  buffers  for  segments.  For  practical 
implementation,  the  priority  computation  can  be  done  in  parallel  with 
the  segment  execution  step,  since  the  task  flows  are  deterministic 
(i.e.,  the  priority  computation  has  access  to  the  tasks  in  the  buffer  as 
well  as  those  about  to  arrive).  Therefore,  as  soon  as  the  tasks  leave 
the  segments  and  arrive  at  the  buffers,  the  highest  priority  tasks  can 


be  selected  and  sent  out.  Thus  the  only  additional  time  required  is 
the  transfer  time,  namely,  time  to  place  a task  in  or  remove  a task 
from  a buffer.  This  time  can  be  added  to  the  time  to  perform  a single 
computation  step.  Thus  for  theoretical  purposes,  one  can  assume  that 
the  above  three  steps  are  being  performed  in  zero  time  at  each  time 
instant . 

Throughout  this  chapter  when  we  say  "a  task  leaves  a buffer 
after  time  t"  we  mean  that  task  leaves  at  time  instant  t+1  or  t+2,  etc; 
"a  task  arrives  at  a buffer  no  earlier  than  t"  means  that  the  task 
does  not  arrive  at  t-1  or  t-2,  etc,  but  it  may  arrive  at  t or  t+1  or 
t+2,....  etc. 

As  an  illustration  of  FIFO-global  strategy,  the  resulting 
flow  pattern  in  the  single  function  pipeline  of  Figure  3.2.2a  with 
initiation  cycle  (1,2,3)  is  given  in  Figure  3.2.2b.  B^,  and 

are  respectively  the  buffers  at  the  input  of  segments  S^,  and  S^. 

The  time  instants  at  which  a task  arrives  at  or  leaves  the  pipeline  are 
indicated  by  i . Each  unit  time  interval  is  marked  with  the  preceding 
time  instant.  A number  i^  in  a segment  row  indicates  that  task  T^  is 
being  processed  for  its  computation  step  during  the  tirre  interval 
marked  on  the  column.  And  i^  in  a buffer  row  indicates  that  task  T^ 
has  completed  j computation  steps  and  is  waiting  for  (j+l)t'1  computa- 
tion step. 

Looking  closely  at  the  Figure  3.2.2b  one  can  notice  that  the 
usage  pattern  of  segments  and  buffers  becomes  periodic  after  a small 
transient.  To  see  this  consider  the  columns  with  labels  10  and  22. 


Suppose  we  relabel  the  index  of  each  task  by  (index  - 6)  in  column 
22  and  onwards.  Since  the  priority  depends  only  on  the  relative 
magnitude  of  the  indices,  subtracting  a constant  from  all  indices 
of  competing  tasks  does  not  affect  the  original  priority.  If  this 
relabeling  is  done  then  one  can  easily  see  that  the  usage  pattern  of 
segments  and  buffers  in  column  22  and  beyond  it  is  identical  to  the 
usage  pattern  in  column  10  and  beyond  in  the  original  flow  chart.  This 
means  that  the  pipeline  usage  pattern  is  periodic  with  period  12.  One 
can  also  see  from  the  flow  chart  that  the  buffers  are  bounded  and  that 
the  average  output  rate  of  the  pipeline  is  equal  to  the  average  input 
rate,  namely;  one  task  every  2 time  units. 

Recall  from  Section  2.9  that  cycle  (1,2,3)  cannot  be  allowed 
by  any  pipeline  with  lower  bound  latency  2.  However,  by  employing 
internal  buffers  we  have  been  able  to  utilize  the  cycle  (1,2,3)  on  a 
pipeline  with  lower  bound  2. 

Provided  that  workload  remains  < 100%  one  can  conjecture 
that  for  periodic  initiations  in  a finite  pipeline  with  FIFO-global 
priority,  the  usage  pattern  eventually  becomes  periodic.  This  conjecture 
follows  immediately  if  one  can  prove  that  the  queues  remain  bounded  in 
length.  If  a queue  never  goes  to  zero  then  we  can  show  that  it  remains 
bounded.  Since;  if  the  queue  is  never  empty,  the  segment  associated 
with  that  queue  is  busy  100%  of  the  time.  By  assumption,  the  workload 
on  any  segment  is  £ 100%,  therefore  the  queue  cannot  grow  indefinitely 
and  it  must  eventually  reach  a finite  upper  bound  in  finite  time. 


101 


However  this  and  similar  other  reasoning  employing  the  segment  utiliza- 
tion and  workload  arguments  have  failed  to  disprove  the  following 
hypothetical  situation. 

Consider  a queue  which  reaches  some  peak  value  and  then 
returns  to  zero.  It  then  grows  again  to  another  peak  value  higher  than 
the  previous  peak  and  then  comes  back  to  zero.  If  this  continues  (see 
Figure  3.2.3)  then  no  periodic  pattern  is  achieved  and  the  queue  does 
not  remain  bounded.  However,  such  a situation  has  never  arisen  in  any 
of  the  examples  we  have  tried.  For  specific  pipelines  and  cycles  we 
have  been  able  to  apply  some  detailed  arguments  to  show  that  the 
queues  remain  bounded.  However,  these  arguments  are  too  specific  to 
be  generalized.  Moreover,  the  details  of  the  arguments  almost  amount 
to  writing  out  the  flow  of  tasks  such  as  in  Figure  3.2.2b. 

Even  if  the  conjecture  that  the  queues  remain  bounded  is 
accepted,  the  question  still  remains  regarding  the  magnitude  of  bounds 
on  queue  size  and  wait  time.  As  yet,  our  intuition  has  not  yielded 
any  reasonable  guess  regarding  the  bounds.  Thus,  FIFO-global  priority 
still  remains  unsolved.  LIFO-globAl  priority, however , is  completely 
solved  as  discussed  in  the  following  section. 


3.3.  Bounds  on  Queue  Size  and  Wait  Time  in  LIFO-global  Priority 

As  defined  earlier  in  LIFO-global  a task  T^  has  a priority 
over  all  earlier  tasks  T^^.  An  example  of  a usage  pattern  in  a LIFO- 
global  pipeline  is  given  in  Figure  3.3.1b.  The  initiation  cycle  used  is 


I 

I 

I 

I 

I 


0 

1 

2 

3 

4 

So 

X 

X 

s, 

X 

S2 

3 

X 

(a)  Reservation  table. 


102 


i 


TASK 

0 I 


J 


Hi  Hi 


9 10 

i i 


0 1 

2 

3 

4 

5 

6 

7 

6 

9 

10 

1 1 

12 

13 

14 

15 

16 

1 7 

18 

19 

20 

So 

□ 

n 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

□ 

a 

93 

s, 

■ 

ES 

H 

■ 

a 

■ 

■ 

a 

■ 

a 

■ 

■ 

a 

a 

■ 

■ 

■ 

a 

I02 

s? 

a 

a 

n 

a 

a 

D 

a 

a 

a 

i 

a 

a 

a 

a 

75 

Bo 

a 

■ 

a 

■ 

■ 

■ 

■ 

a 

□ 

B, 

□ 

□ 

□ 

■ 

b2 

■ 

■ 

■ 

m 

a 

a 

a 

a 

a 

■ 

* \ 

' \ 

i 

] 

I 

0 2 13  5 4 6 8 


(b)  Flow  in  pipeline  (a)  with  cycle  (1,2,3). 


I 

I Figure  3.3.1.  Flow  of  tasks  in  a buffered  pipeline  with  LIFO- 

global  priority. 

I 

I 

I 

I 


(n  = 3 in  this 


cycle  (1,2,3).  Notice  that  any  two  tasks  T,  and  T 

1 1tT1 

example)  have  identical  usage  patterns.  This  is  not  a sheer  coinci- 
dence as  we  shall  see  in  the  theorem  soon  to  follow.  It  is  precisely 
due  to  this  property  that  LIFO-global  lends  itself  to  a clean 
theoretical  treatment.  Unless  otherwise  specifically  mentioned,  the 
word  pipeline  refers  to  both  single  and  multifunction  pipelines  in 
the  remainder  of  this  chapter. 

Theorem  3.3.1:  In  a buffered  pipeline  with  LIFO-global  priority,  tasks 

T^  and  T^^  have  the  same  usage  pattern,  where  n is  the  number  of  task 
initiations  in  a cycle. 

Proof : By  definition  of  LIFO-global,  T^  has  a priority  over  T^  ^ for 

all  k>0.  Thus  the  usage  pattern  of  T^  is  unaffected  by  any  previously 

initiated  tasks  T^ and  is  governed  solely  by  future  initiations.  The 

initiations  are  periodic  and  infinite,  therefore  tasks  T.  and  T. . see 

I l+n 

identical  future  initiation  sequences.  Thus  T^  and  Tj^  have  identical 
usage  patterns.  Q.E.D. 

We  did  not  use  our  assumption  on  the  workload  in  the  above 
proof  and  thus  the  theorem  is  valid  for  any  workload. 

The  following  corollary  is  merely  a consequence  of  two  usage 
patterns  being  identical.  It  is  also  valid  for  any  workload. 

Corollary  3. 3. 1.1: 

1.  Ti  arrives  at  segment  s for  its  k**1  computation  step  at  time 
t ■>  Tt+jn  arrives  at  segment  s for  its  kth  step  at  time  t+jp. 


2. 


is  processed  by  segment  s for  its  k*-^1  computation  step  during 
(t,t+l)  ■>  Ti+jn  is  processed  by  segment  s for  its  step  during 
(t+jp.t+jp+l).  Vj  > 0 

3.  Segment  s is  busy  during  (t,t+l)  =>  segment  s is  busy  during 

<t+jp,t+jp+l>.  Vj  ^ 0.  D 

The  following  lemmas  lead  to  the  establishment  of  queue  bounds 
as  well  as  bounds  on  wait  times. 

Lemma  3.3.2. 1:  In  a buffered  pipeline  with  LIFO-global  priority,  a 

segment  i does  not  receive  more  than  E m,  „n„  tasks  during  any  interval 

X 1X  * 

<t,t+p>,  t £ 0. 


Proof:  Suppose  some  segment  i receives  more  than  E m,„n„  tasks  during 

X 1X  x 

some  interval  (t,t+p).  Then  by  (1)  of  Corollary  3.3. 1.1,  the  segment 

i also  receives  more  than  E m n tasks  during  (t+p,t+2p), 

X 1a  X 

(t+2p,t+3p) . . . . and  so  on.  This  implies  that  on  average,  segment  i 

is  subjected  to  a workload  greater  than  E m n . However,  the  input 

P jr  1.X  X 

rate  implies  a workload  no  greater  than  — E m n . Thus  we  have  a 

P IX  X 

contradiction  and  hence  the  segment  i does  not  receive  more  than 


E m^n^  tasks  during  any  interval  (t,t+p), 


Q.E.D. 


Lemma  3. 3. 2. 2 : In  a buffered  pipeline  with  LIFO-global  priority,  a 

segment  i cannot  be  busy  for  more  than  E m.  n„  time  units  during  any 

X tx  x 

interval  (t,t+p)  t £ 0. 

Proof:  Suppose  segment  i is  busy  for  more  than  E m.„n„  units  during 

X lx  x 

some  interval  (t,t+p).  Then  by  (3)  of  Corollary  3.3. 1.1,  the  segment  i 
is  also  busy  for  more  than  E ®lxnx  units  during  intervals  (t-fp,t+2p), 

( t+2p,t+3p) . . . . and  so  on.  This  implies  that  segment  i,  on  the  average. 


105 


remains  busy  for  more  than  — £ m.  n„  of  the  time.  This  is  impossible 

p x iX  X 

because  the  workload  of  the  segment  is  — £ m . n„ . Thus  even  if  segment 

P X iX  X 

i processes  the  tasks  at  the  input  rate  it  cannot,  on  the  average, 

remain  busy  for  more  than  ^ £ mixnx  of  the  time.  Thus  segment  i does 

not  remain  busy  for  more  than  £ m n units  during  any  interval  which 

X iX  X 

is  p units  long.  Q.E.D. 

Lemma  3. 3. 2. 3:  In  a buffered  pipeline  with  LIFO-global  priority,  and 

workload  < 100%,  two  equally  processed  tasks  T^  and  Ti+jn  Cj  > 0)  cannot 
be  in  the  same  buffer  at  the  same  time. 

Proof : Suppose  a buffer  does  have  two  equally  processed  tasks  T^  and 

T, ..  for  some  j > 1.  By  definition,  T . . . has  priority  over  T. . Thus 
i+jn  J ' i+jn  J i 

if  either  T,  or  T,  ,,  ever  leave  the  buffer,  T, would  leave  earlier 
i i+jn  * i+jn 

than  T, . However,  this  violates  Theorem  3.3.1,  namely,  that  T.  ..  must 
i J i+jn 

follow  T^  exactly  jp  time  units  later.  Therefore  the  only  other 
possibility  is  that  neither  T^  nor  Ti+jn  ever  leaves  the  buffer.  By 
Theorem  3.3.1  all  such  tasks  T^^n  (for  all  k ^ 0)  would  accumulate 
in  the  same  buffer.  This  in  turn  implies  that  the  segment  associated 
with  this  buffer  is  busy  100%  of  the  time  and  the  number  of  tasks  in 
the  buffer  is  growing  indefinitely.  This  is  impossible  because  of  the 
assumption  that  workload  is  < 100%.  In  other  words  if  a segment 
remained  busy  1007.  of  the  time  the  queue  at  that  segment  cannot  grow 
indefinitely.  Thus  no  two  equally  processed  tasks  T^  and  Ti+jn  can 
reside  in  the  same  buffer  at  the  same  time.  Q.E.D. 

One  can  immediately  establish  an  upper  bound  on  queue  length 
from  the  above  lemma.  This  is  because  the  number  of  computation  states 


of  a task  is  finite  and  the  number  of  distinct  task-subscripts  modulo  n 
is  finite. 

Thus  for  example  in  a single  function  pipeline  a task  which 
arrives  at  a segment  i can  be  in  one  of  nu  computation  states,  where 
nu  as  defined  earlier  is  the  number  of  times  segment  i is  used  for  a 
single  task.  Thus,  by  the  above  lemma  there  can  be  at  most  m^n  tasks 
in  the  buffer  at  segment  i.  By  taking  into  account  some  additional 
considerations  we  shall  present  a somewhat  tighter  upper  bound  later, 
but  first  we  prove  some  more  lemmas. 

Lemma  3. 3. 2. 4:  In  a buffered  pipeline  with  LIFO-global  priority  and 

workload  < 100% , a task  waits  at  most  (p-1)  time  units  for  a single 
computation  step. 

Proof : From  the  corollary  to  Theorem  3.3.1,  if  a task  T^  arrives  at  a 

segment  for  its  k*-*1  computation  step  at  time  t then  a task  T.  arrives 

i*tn 

at  the  same  segment  for  its  ktl*  computation  step  at  time  t+p.  If  T^ 
did  not  leave  the  buffer  before  t+p  then  the  two  equally  processed 
tasks  T.  and  T, . would  reside  together  in  the  same  buffer,  a contra- 

1 ItTI 

diction  to  Lemma  3. 3. 2. 3.  Thus  T^  must  leave  the  buffer  before  time  t+p, 
that  is,  it  waits  no  more  than  (p-1)  time  units.  Q.E.D. 

The  above  lemma  is  our  assurance  that  all  the  tasks  will  be 
processed.  This  assurance  is  necessary  because  intuition  leads  us  to 
believe  that  in  the  last-in-first-out  type  of  priority  there  is  a 
possibility  that  a particular  task  may  wait  forever  in  some  buffer. 

The  bound  of  Lemma  3. 3. 2. 4 can  be  tightened  as  shown  in  the  following 


107 


Lemma  3 ■ 3 . 2 . 5 : In  a buffered  pipeline  with  LIFO-global  priority  and 

workload  < 1007,  a task  waits  no  more  than  (£  m n ) - 1 time  units  at 

X iX  X 

segment  i for  any  single  computation  step. 

Proof : By  assumption  the  workload  of  any  segment  i is  < 1007.,  that  is, 

X miXnX  — p*  x miXnX  = P t^en  we  ^ave  already  proved  the  above 

statement  in  Lemma  3. 3. 2. 4.  Therefore  it  remains  to  be  shown  that  the 

statement  is  also  true  when  £ m n < p. 

X iX  X 

Suppose  a task  arrives  at  segment  i at  time  t and  waits 

for  more  than  £ m n - 1 time  units,  that  is,  it  waits  for  at  least 
x *-x  x 

X miXnX  t*"me  un*-t8>  This  implies  that  the  segment  i was  busy  during 

the  entire  interval  (t,t+^  m^n^) . Now  3. 3. 2. 4,  T^  must  be 

accepted  by  segment  i before  time  t+p.  Thus  the  segment  is  busy  for 

at  least  (£  m n )+l  time  units  during  (t,t+p>.  This  is  a contradic- 
X IX  x 

tion  to  Lemma  3. 3. 2. 2.  Thus  T,  does  not  wait  more  than  (£tn.  n Vl  units 

k VX  iX  X' 

at  segment  i for  any  one  computation  step.  Q.E.D. 

Lemma  3 . 3 . 2 . 6 : In  a buffered  pipeline  with  LIFO-global  priority  and 

workload  < 100%,  the  buffer  at  segment  i is  empty  during  at  least  one 

unit  interval  in  the  interval  (t,t+£  m,  n„)  for  all  t > 0. 

x iX  X 

Proof : Again  by  the  workload  assumption, 

l mixnx  ^ p- 

(1)  Let  £ mixnx  < P-  ^ nonempty  buffer  for  £ ^x1^  successive  time 

units  implies  that  its  segment  is  busy  for  at  least  (£  m.  n )+l 

X ix  x 

successive  time  units,  that  is,  the  segment  i remains  busy  for 
<£  mixnx) + 1 units  during  some  interval  (t,t+p>.  This  is  a 


108 


contradiction  to  Lemma  3. 3. 2. 2.  Therefore  the  buffer  must 
remain  empty  for  at  least  one  time  unit  during  any  interval 
(t.t+Z  mixnx>. 

(2)  Let  £ m n =p.  Suppose  that  the  buffer  at  segment  i remains 
X X 

nonempty  throughout  the  interval  (t,t+p)  for  some  t.  Let  the  p 
tasks  processed  successively  by  segment  i during  that  interval  be 
T,  ,T.  , . . . ,T  respectively. 

1 2 p 

Consider  some  task  T.  which  is  in  the  buffer  during  (t,t+l). 


Ji 


Now  either 


T,  must  be  of  lower  priority  than  T.  , that  is,  j.  < i.  . 

J1  1 
T remains  in  the  buffer  throughout  (t,t+p)  or  it  is  one  of  the  tasks 
J1 

processed  during  (t+l,t+p),  that  is,  = i^i  for  some  k'  = 2,3, ... ,p. 

If  T.  remains  in  the  buffer  throughout  (t,t+p),  then  j < i 


J 


1 


l<ak 


for  all  k = l,2,...,p.  Alternatively,  if  = i^t  ^or  some  k'  = 2,...,p, 

then  J-Ci.  for  all  k = 1,2, . . . ,k'-l.  Consider  some  task  T.  in  the 
•IK  ^2 

buffer  during  ( t+k ' -l,t+k ' ) . We  must  have  J2<iTc,=ji>  is,  < i^ 

for  all  k = l,2,...,k'.  Similarly,  either  T.  remains  in  the  buffer 

J2 

throughout  (t+k'-l,t+p)  or  j2=k"  for  some  k" = k'+l ,k ’+2 , . . . ,p. 

Continuing  in  this  fashion,  by  finite  induction,  we  can  show  that  there 

must  be  a task  T which  is  in  the  buffer  during  (t+p-l,t+p)  with 
Jk 

jk  < ii»i2»',*»1p‘  By  Theor6®  3.3.1,  tasks  T£  +n.-»-»Ti  ^ will 

be  processed  by  segment  i during  (t+p,t+2p)  etc.,  implying  that  task 

T,  never  leaves  the  buffer.  This  is  a contradiction  to  Lemma  3. 3. 2. 4, 
Jk 

namely  that  no  task  waits  in  a buffer  for  more  than  (p-1)  time  units. 
Therefore,  the  supposition  that  the  buffer  remains  nonempty  during 
(t,t+p)  is  false.  Q.E.D. 


109 


Theorem  3.3.2:  In  a buffered  pipeline  with  LIFO-global  priority  and 

workload  < 100%,  the  maximum  number  of  tasks  in  the  buffer  at  segment  i, 
during  any  unit  time  interval  is 

< £ m4YnY  ” max (nY) ' 

x iX  X x x 

Proof : By  Lemma  3. 3. 2. 6,  the  buffer  at  segment  i is  empty  during  at 

least  one  time  unit  in  the  interval  (t,t+p),  Vt  > 0.  Since  the 

behavior  of  the  pipeline  is  periodic  with  period  p,  the  buffer  becomes 

empty  every  p time  units.  Consider  a unit  interval  (t'-l,t')  during 

which  the  buffer  is  empty.  By  Lemma  3.3.2. 1 at  most  E m n tasks  arrive 

X lX  X 

at  the  buffer  during  an  interval  (t,t+p),  Vt  > 0.  Therefore,  at  most 

£ m.  n tasks  arrive  in  the  interval  (t',t'+p).  The  worst  case 
X rx  X 

accumulation  of  tasks  occur  in  the  buffer  when  these  E m n tasks  arrive 

x iX  X 

at  the  maximum  possible  rate.  During  any  unit  interval  (t,t+l)  no  more 
than  m^  tasks  of  type  X can  arrive  at  segment  i.  Thus  it  takes  at 
least  time  units  for  mixnx  tas^s  type  X to  arrive  at  the  buffer 


of  segment  i.  Thus  the  shortest  interval  in  which  all  these  E m. 


xnx 


tasks  can  arrive  is  (t 1 ,t 1 +max(ri  ))  . The  segment  i processes  one  task 

X X 

every  time  unit.  Thus  max(ng)  tasks  leave  the  buffer  during  the 

interval  (t 1 ,t ' +max(n^))  , implying  that  the  largest  accumulation  that 
X 

could  have  occurred  in  the  buffer  is 


£ m,  n -max(n„)  tasks. 
X IX  x „ X 


Q.E.D. 


Corollary  3. 3. 2.1:  In  a buffered  single  function  pipeline  with  LIFO- 

global  priority  and  workload  < 100%,  the  maximum  number  of  tasks  in  the 


110 


buffer  at  segment  i,  during  any  unit  time  interval  is 

< m.n  - n.  O 

— l 

For  the  example  in  Figure  3.3.1a  and  cycle  (1,2,3)  the  maximum 
sizes  of  buffers  from  the  above  corollary  are:  Bq<3,  B^<0  and  B2<3. 
From  Figure  3.3.1b,  the  actual  sizes  are:  B^  = l,  B^  = 0 and  B2  = 1. 

Theorem  3.3.3 : In  a buffered  pipeline  with  LIFO-global  priority  and 

workload  100%,  the  wait  times  at  segment  i satisfy  the  following 
bounds : 

1.  The  maximum  wait  time  of  any  task  for  any  one  computation  step 


? mixV 


1. 


2.  The  expected  wait  time  of  a task  for  a computation  step 


1 v X 

2 x miXnX  ' 2 | m.xnx  * 

Proof : 1.  Already  proven  in  Lemma  3. 3. 2. 5. 

2.  The  behavior  of  the  pipeline  is  periodic  with  period  p.  Moreover 
by  Lemma  3. 3. 2. 6 the  buffer  becomes  empty  at  least  once  during  each 
interval  (t,t+p).  Thus  we  can  find  the  bound  on  the  expected  wait 
time  by  considering  the  maximum  total  wait  time  of  the  tasks  at 
segment  i,  in  an  interval  (t,t+p)  and  averaging  over  the  total 
number  of  computation  steps  in  that  interval. 

The  total  wait  time  of  tasks  in  (t,t+p)  at  segment  i = (sum 
of  the  departure  times  from  the  buffer  at  segment  i)  - (sum  of  the 


arrival  times  at  segment  i). 


Ill 


To  get  the  worst  case  wait  time,  we  assume  that  arrivals 

occur  as  early  as  possible.  Consider  a unit  interval  (t-l,t)  when  the 

buffer  is  empty.  By  Lemma  3. 3. 2.1,  a maximum  of  £ m n tasks  arrive 

X lX  x 

during  (t,t+p).  The  arrivals  occur  with  the  restriction  that  no  more 
than  m^  tasks  of  type  X arrive  during  any  one  time  unit  and  that  a 
total  of  mixnx  tasks  of  type  X arrive  during  (t,t+p).  Thus  by  the 
worst  case  assumption  nux  tasks  of  type  X arrive  every  time  unit 
starting  at  t,  for  n^  consecutive  time  units.  Thus  the  sum  of  arrival 
times  of  tasks  of  type  X in  the  interval  (t,t4p)  is, 

m!vtlv(nv-1) 


£ (t+j)m  = m n t + 
0<j<n  1X  1X  X 

A. 


iX  X'  X 


Then  the  sum  of  all  arrival  times  at  segment  i is, 


i[mixV  +2  miXnX(V1)L 


(a) 


The  segment  processes  one  task  every  time  unit.  Thus  one  task  departs 
from  the  buffer  every  time  unit  and  therefore  the  sum  of  the  departure 


times  of  £ m n tas^s  is 
X tx  X 


L 


<t+j)  - £ i"lx"xt+2  <£  •1I"j)((?  i»<vnv)-l). 


iX  X 


(b) 


Then  the  sum  of  the  wait  times  in  the  worst  case  is  (b)-(a) 

. 1 ^ .2  1 v 2 

2 (JmiXnX)  -2^mixnX- 

This  quantity  averaged  over  the  number  of  computation  steps  £ m.  n , 

X iX  x 

gives  the  expected  wait  time  for  a computation  step. 


’V 

' - 


Corollary  3. 3. 3.1:  In  a buffered  single  function  pipeline  with  LIFO- 

global  priority  and  workload  < 100%,  the  wait  times  at  segment  i satisfy 
the  following  bounds: 

1.  The  maximum  wait  time  of  any  task  for^any  one  computation  step  at 
segment  i is, 

< m^n  - 1. 

2.  The  expected  wait  time  of  a task  for  a computation  step  at  segment 
i is, 

< \ O^n  - n)  . □ 

For  the  example  in  Figure  3.3.1a  and  cycle  (1,2,3)  the 
maximum  wait  time  in  the  buffer  at  segment  is  < 5 and  the  maximum 
expected  wait  time  < 3/2.  The  actual  values  for  this  example,  from 
Figure  3.3.1b,  are  2 and  1/3  respectively. 

In  the  majority  of  cases  the  actual  queue  sizes  and  wait  times 
remain  considerably  below  the  bounds  stated  in  Theorems  3.2.2  and  3.3.3. 
However,  the  bounds,  with  the  exception  noted  below  are  tight.  Tight 
in  the  sense  that  for  each  bound  on  queue  size  and  wait  time  and  any 
given  values  of  and  n^,  there  always  exist  an  initiation  cycle  and 
a pipeline  which  actually  attain  the  upper  bound  values.  It  may  not 
be  possible,  however,  to  achieve  all  bounds  simultaneously  in  a single 


pipeline . 


113 


The  one  exception  to  the  tightness  of  the  bounds  occurs  when 
£ mix  = 1,  that  is,  there  is  only  one  usage  of  segment  i in  the 
reservation  table.  In  this  case  there  can  be  no  collisions  at  segment 
i and  therefore  the  wait  time  is  zero.  However  the  bound  on  the  maximum 
wait  time  in  (1)  of  Theorem  3.3.3  may  have  a nonzero  value.  The  bounds 
on  the  queue  size  and  the  expected  wait  time  are  tight  without  any 
exception.  The  following  theorem  shows  that  the  bounds  are  tight.  We 
have  shown  this  only  for  single  function  pipelines  for  simplicity. 
Theorem  3.3,4:  Given  m>l  and  n>0,  there  exist  for  each  bound  of 

Corollaries  3. 3. 2.1  and  3. 3. 3.1,  a cycle  and  a pipeline  such  that  the 
bound  is  reached  in  the  pipeline. 

Proof : We  shall  only  sketch  the  proof  rather  than  give  a detailed 

explanation  of  every  step.  The  accompanying  illustrations  should  help 
clarify  most  of  the  assertions  made  in  this  proof. 

1.  We  claim  that  the  bounds  on  the  maximum  queue  size  and  the  expected 

wait  time  are  reached  in  the  following  pipeline  with  cycle 

(1, 1, . . . ,l,mn-n+l) . The  pipeline  is  constructed  to  have  a row 
(n-1)  l's 

(j)integer  j,  3 integer  k,  0<k<m-l  such  that  j = kmn  - 

and  all  other  rows  with  one  X per  row.  The  number  of  these  rows 
is  of  no  consequence.  (See  Figure  3.3.2a,  for  m = 3 and  n = 3,  the 
rows  with  one  X are  not  shown.) 

To  prove  the  above  claim,  note  that  n tasks  are  initiated 
successively.  They  do  not  suffer  any  delay  at  their  first  step.  (In 
this  discussion  the  first,  second  etc.  steps  refer  to  the  steps  of  the 


115 


segment  under  consideration.)  The  arrival  for  second  step  on  the 

segment  is  exactly  one  period  (=mn)  later.  Thus  the  first  group  of 

n tasks  collide  with  the  second  group  of  tasks.  (See  Figure  3.3.2b 

for  the  flow  of  tasks  in  the  segment  and  its  buffer.)  By  LIFO-global 

priority  the  first  group  of  n tasks  has  to  wait  for  the  second  group  of 

2 

n tasks.  The  first  group  thus  suffers  a total  of  n time  units  of 

delay.  The  first  group  then  collides  at  the  third  step  with  groups  2 

2 

and  3 and  it  suffers  a total  of  2n  units  of  delay.  Similarly,  it 

suffers  a total  of  3n^  units  of  delay  at  the  4^  step  and  so  on.  Thus 

2 2 2 

the  first  group  of  n tasks  suffers  a total  of  0+n  +2n  + •••  + (m-l)n  = 
_ m(m-l)n  unjtg  Qf  <jeiay  for  a total  of  mn  computation  steps.  By 

periodicity  of  the  pipeline,  all  other  groups  also  suffer  the  same 

1 2 1 

delay.  Then  the  expected  delay  at  each  step  is  j m(m-l)n  /mn  = — (m-l)n. 
which  is  the  same  as  the  bound. 

To  see  how  the  bound  on  the  maximum  queue  size  is  reached 
notice  that  m different  groups  of  n tasks  each,  collide  with  each 
other  over  a time  interval  n units  long  (e.g.  time  units  18,  19  and  20 
in  Figure  3.3.2b).  In  that  interval  m tasks  are  received  at  each  time 
unit  and  one  task  is  processed  by  the  segment  at  each  time  unit.  Thus 
an  accumulation  of  (m-1)  tasks  occur  every  time  unit  for  a total  of  n 
units.  Thus  the  maximum  queue  size  of  (m-l)n  is  reached. 

2.  We  claim  that  the  bound  on  the  maximum  wait  time  at  a step  is 

reached  in  the  following  pipeline  with  cycle  (1,1, . . . ,l,mn-n+l) . 

l's 

This  pipeline  is  constructed  to  have  a row  {0,n,2n, . . . , (m-2)n}U[mn} ; 


JS 

u n 

•4 

V i 
J2  -H 


« t* 

s 

iS  i 

cd 


u ° 

•h  »q 

u c 

4J  O 

•H  PQ 

5 


117 

and  all  other  rows  have  one  X per  row.  (See  Figure  3.3.3a  for  m = 4 
and  n*3,  the  rows  with  one  X are  not  shown.) 

The  proof  of  the  above  claim  is  straightforward.  A group  of 
n tasks  initiated  does  not  suffer  any  delay  for  the  first  (m-1)  steps 
on  the  segment.  (See  Figure  3.3.3b  for  the  flow  of  tasks.)  For  the 
mc^  step  the  same  group  collides  with  the  next  incoming  group  of  n 
tasks.  The  second  group  uses  the  segment  for  (m-l)n  time  units.  Thus 
the  task  Tq  is  delayed  for  (m-l)n  time  units.  T^  also  waits  for  the 
tasks  T^  through  Tn  because  of  LIFO-global  priority.  Thus  task  T^ 
waits  for  a total  of  (m-l)n+(n-l)  = mn-1  time  units  at  the  mc^  step. 
This  is  the  same  as  the  upper  bound  value.  Q.E.D. 

We  mentioned  earlier  that  the  internal  buffers  can  be  thought 
of  as  adaptive  elemental  delays.  The  best  example  of  this  occurs  in  a 
single  function  LIFO-global  pipeline  with  a constant  latency  cycle. 
Recall  the  procedure  to  make  a pipeline  allowable  with  respect  to  a 
constant  latency  cycle  described  in  the  proof  of  Theorem  2.3.1. 

The  elemental  delays  were  inserted  to  avoid  collisions,  starting  from 
the  leftmost  column  to  the  right.  The  task  flow  in  the  pipeline 
modified  in  this  manner  is  identical  to  that  of  the  original  pipeline 
employing  LIFO-global  buffers.  This  happens  because  when  modifying 
the  table,  we  inserted  delay  elements  of  a later  computation  step,  which 
is  to  say  that  when  two  tasks  collide,  the  earlier  arriving  task  is 
always  delayed  over  the  later  arriving  task.  This  delay  insertion 
procedure  amounts  to  a LIFO-global  priority. 


3.4.  Priority  Schemes  Based  on  Processing  State 


We  mentioned  earlier  that  FIFO-global  and  LIFO-global 
priorities  were  somewhat  difficult  to  implement.  We  shall  see  that 
the  priorities  based  on  the  processing  state  of  a task  are  simpler 
to  implement.  In  all  such  schemes  we  assume  an  implementation  in 
which  one  buffer  is  provided  for  each  computation  step  rather  than  one 
at  each  segment.  A schematic  showing  the  buffers  for  this  type  of 
implementation  is  given  in  Figure  3.4.1b.  What  the  following  schemes 
describe  are  priorities  among  buffers  rather  than  those  within  a 


buffer . 

1. 

MPF: 

Most  Processed  First. 

2. 

LWRF: 

Least  Work  Remaining  First. 

3. 

LPF: 

Least  Processed  First. 

4. 

WRF: 

Most  Work  Remaining  First. 

In  a single  function  pipeline  the  most  processed  task  is 
also  the  task  with  least  work  remaining.  Thus  MPF  and  LWRF  are 
identical  for  single  function  pipelines.  Similarly  LPF  and  MWRF  are 
also  identical  for  single  function  pipelines. 

In  Figure  3.4.1b  if  the  priority  used  is  MPF  or  LWRF  then 
the  tasks  in  buffer  2 would  have  priority  over  the  tasks  in  buffer  0 
and  similarly  buffer  4 would  have  priority  over  buffer  3.-  If  the 
priority  was  LPF  or  MiTRF  then  buffers  0 and  3 would  have  priority  over 
buffers  2 and  4 respectively. 

In  a multifunction  pipeline  it  is  possible  that  two  or  more 
buffers  at  some  segment  have  tasks  which  are  equally  processed.  For 


I 

> i 

I 


0 

i 

2 

3 

4 

So 

X 

X 

s. 

X 

S2 

[X 

X 

(a)  Reservation  table. 


I 

! 


>4 


L J3H 


120 


example  buffers  1 and  lg  in  Figure  3.4.2b  have  tasks  which  have 

completed  1 step.  Thus  MPF  and  LPF  schemes  would  not  define  the 

priority  uniquely.  Similarly  the  buffers  2.  and  3„  in  the  same  figure 

A B 

have  tasks  which  have  an  equal  amount  of  work  remaining,  namely,  1 step. 
And  therefore  LWRF  and  MWRF  do  not  assign  the  priority  uniquely. 

Whenever  such  ambiguity  arises  we  shall  assume  that  some  arbitrary 
decision  is  made  on  the  priority  among  competing  buffers.  However, 
once  such  priority  is  assigned  we  assume  that  it  remains  fixed. 

A question  still  remains  regarding  the  priority  among  tasks 
within  a buffer.  This  priority  is  of  little  consequence  as  stated 
below. 

Theorem  3.4.1:  In  pipelines,  which  use  a priority  based  on  processing 

state  and  use  one  buffer  per  computation  step,  the  queue  size  at  any 
instant  in  time,  the  average  wait  time  of  a task  in  a buffer  and  the 
pipeline  throughput  are  independent  of  the  priority  used  among  tasks 
within  a buffer. 

Proof : As  far  as  the  current  queue  size  is  concerned,  it  is  of  no 

consequence  which  task  is  chosen  to  leave  the  buffer.  All  the  tasks 
within  a buffer  are  of  the  same  function  type  and  are  in  the  same 
state  of  computation.  And  because  the  priority  among  buffers  is  based 
on  the  processing  state,  any  task  leaving  the  buffer  will  have  the 
same  effect  on  the  flow  of  other  tasks.  This  in  turn  implies  that 
future  queue  sizes  are  independent  of  the  task  chosen  to  leave  the 
buffer.  The  average  wait  time  of  a task  in  a buffer  is  a function 
of  the  average  queue  size  and  average  output  rate  of  the  buffer.  Both 


122 


of  which  are  independent  of  the  priority  within  a buffer.  The 
throughput  is  a function  of  average  output  rate  of  the  buffer.  Again 
both  of  these  are  independent  of  the  priority  within  a buffer.  Thus 
the  queue  size,  the  average  wait  time  and  the  throughput  are  indepen- 
dent of  the  priority  within  a buffer.  Q.E.D. 

Also  note  that  the  above  theorem  is  valid  whether  the  input 
is  periodic  or  random.  We  should  also  stress  that  it  is  the  average 
and  not  necessarily  the  maximum  or  variance  of  the  wait  time  which  is 
independent  of  the  priority  within  a buffer. 

A complete  priority  description  will  be  denoted  by  the 
priority  among  buffers  followed  by  the  priority  within  the  buffer.  For 
example,  LPF,  FIFO-buffer,  or  MWRF,  LIFO-buffer,  etc.  In  any 
discussion  when  the  priority  within  a buffer  is  unimportant,  we  shall 
not  specify  it  but  shall  simply  refer  to  the  priorities  among  the 
buffers . 

Priorities  based  on  processing  states  are  considerably 
simpler  to  implement  than  FIFO-global  or  LIFO-global  priorities.  Since 
we  have  assumed  one  buffer  per  computation  step,  it  is  seen  that  each 
buffer  has  a unique  successor  buffer  and  a unique  predecessor  buffer. 
Thus  the  control  of  the  task  flow  is  simple.  Moreover,  since  a buffer 
holds  only  tasks  of  a single  function  type  and  a single  state  of 
computation,  no  tag  is  necessary  to  identify  the  state  of  a task.  The 
priority  among  buffers  is  fixed  and  hence  no  elaborate  control 
mechanism  is  required  to  order  the  tasks.  The  priority  within  a buffer 


can  be  chosen  so  as  to  make  its  implementation  simple;  e.g.,  FIFO  or 
LIFO  priority. 

We  stated  earlier  in  Section  3.2  that  FIFO-global  and  LIFO- 
global  can  sometimes  be  implemented  in  a simpler  way  than  the  one 
described.  In  the  following  we  show  that  in  some  cases  FIFO-global  and 
LIFO-global  are  equivalent  to  some  of  the  priorities  described  in  this 
section. 

Theorem  3.4.2 : The  flows  of  tasks  in  a single  function  pipeline  are 

identical  under  the  following  priorities. 

1.  FIFO-global.  2.  MPF,  FIFO-buffer.  3.  LWRF,  FIFO-buffer. 

Proof : MPF  and  LWRF  are  identical  in  a single  function  pipeline. 

Therefore  the  task- flows  are  identical  under  2 and  3.  Next  we  show 
that  the  task-flows  are  also  identical  under  1 and  2. 

Consider  any  two  tasks  T^  and  T.  (j  > i) . With  FIFO-global 
priority  T^  has  a priority  over  T,  everywhere  in  the  pipeline.  With 
MPF,  FIFO-buffer,  since  all  buffers  are  first-in-first-out  and  T. 

l 

initially  arrives  before  T^ , T^  always  is  either  equally  processed  or 
more  processed  than  Tj  and  T^  arrives  before  T^  at  each  step.  Thus  the 
task  T^  has  priority  over  Tj  everywhere  in  the  pipeline.  Therefore 
the  two  priority  schemes  result  in  the  same  flow  of  tasks.  Q.E.D. 

Theorem  3.4.3:  The  flows  of  tasks  in  a single  function  pipeline 

employing  a constant  latency  initiation  cycle  are  identical  under  the 
following  priorities. 

1.  LIFO-global  2.  LPF,  *-buffer  3.  MWRF,  *-buffer 


where  * stands  for  any  arbitrary  priority. 


Proof : Let  us  consider  what  happens  under  LIFO-global  priority.  By 


Theorem  3.3.1,  tasks  T.  and  7,  have  identical  flow  patterns  under 
LIFO-global.  For  a constant  latency  cycle,  n is  1 and  therefore  every 
task  has  the  same  flow  pattern.  Thus  in  LIFO-global  for  any  two 
tasks  T^  and  T^ , j>i,  which  are  in  different  states  of  computation, 

T is  less  processed  than  T^.  Suppose  these  tasks  are  competing  for 
the  same  segment,  then  under  LIFO-global  T^  has  priority  over  T^ 
because  j > i. 

By  LPF  T.  also  has  priority  over  T.  because  T.  is  less 
J i J 

processed  than  T^.  Thus  we  can  substitute  LPF  for  LIFO-global  whenever 
two  or  more  tasks  in  different  states  of  computation  are  competing  for 
a segment. 

Now  by  Lemma  3. 3. 2. 2,  under  LIFO-global  no  two  equally 
processed  tasks  T^  and  T^^^,  j>0,  can  be  in  the  same  buffer,  and  n = l 
implies  that  no  two  equally  processed  tasks  will  ever  compete  for  the 
same  segment.  If  we  provide  a separate  buffer  for  each  computation 
step,  there  will  never  be  more  than  one  task  in  any  buffer  and  hence 
no  priority  is  required  within  a buffer.  Thus  we  have  transformed  the 
LIFO-global  discipline  into  a LPF,  *-buffer  implementation  without 
affecting  the  task-flow. 

In  a single  function  pipeline  LPF  is  identical  to  MWRF  and 
therefore  the  task-flow  is  also  the  same  under  MWRF,  *-buffer. 

Q.E.D. 

In  the  next  section  we  will  see  that  LIFO-global,  LPF  and  MWRF 


are  similar,  but  not  identical,  under  periodic  but  nonconstant 


initiation.  We  have  seen  in  Theorem  3.4.2  that  FIFO-global,  MPF  and 
LWRF  are  also  similar,  but  we  are  not  able  to  make  any  concrete  comments 
regarding  auy  one  of  them.  Some  bounds  are  presented  for  a priority 
scheme  called  Precedence -Implied  Priority  (PIP)  in  the  following 
section,  where  LPF  and  MWRF  are  shown  to  be  the  special  cases  of  that 
particular  priority  scheme. 


3.5.  Bounds  on  Queue  Size  and  Wait  Time  with  PIP 

Precedence-Implied  Priority  (PIP)  is  introduced  in  this 
section.  When  LPF  and  MfRF  priorities  are  applied  to  multifunction 
pipelines  there  is  ambiguity  in  the  priority  scheme.  PIP  is  a class 
of  priorities  which  includes  all  priorities  with  consistent  resolution 
of  this  ambiguity.  Most  of  the  theorems  to  follow  are  stated  in  terms 
of  PIP  and  are  thus  directly  applicable  to  LPF,  MWRF  and  their 
extens ions . 

In  a Precedence-Implied  Priority  (PIP)  scheme,  for  any  two 
computation  steps  x and  y,  step  x precedes  step  y implies  that  step  x 
of  one  task  has  priority  over  step  y of  another  task.  Moreover,  we 
consider  only  "consistent"  priority  schemes  in  the  sense  that  they 
must  be  transitive,  that  is,  step  x has  priority  over  step  y and 
step  y over  step  z implies  that  step  x has  priority  over  step  z.  It 
is  understood  that  any  priority  scheme  must  be  complete  in  the  sense 
that  the  priority  between  any  two  steps  that  use  the  same  segment  is 


defined . 


Note  that  sometimes  two  steps  performed  on  two  distinct 


segments  may  have  a priority  relation  by  the  above  definition,  even 
though  these  two  steps  can  never  collide  with  each  other.  However, 
the  priority  between  such' steps  should  not  be  ignored  since  it  implies 
other  meaningful  priorities  through  transitivity. 

Note  also  that  the  precedence  relation  is  not  always  sufficient 
to  define  the  priority  among  all  steps,  since  there  does  not  exist 
any  precedence  relation  between  two  steps  of  two  distinct  function 
types.  Thus  one  still  has  considerable  freedom  in  assigning  the 
priorities  required  for  completeness  after  the  precedence  implied 
priorities  have  been  assigned.  Therefore,  PIP  actually  represents  a 
class  of  priorities. 

In  a single  function  pipeline  all  the  steps  are  related  by 
precedence  and  therefore  PIP  defines  a unique  priority,  which  is 
clearly  identical  to  LPF  and  MWRF  priorities.  In  a multifunction 
pipeline  PIP  still  assigns  a LPF  priority  on  each  individual  function 
but  leaves  the  priority  among  steps  of  distinct  functions  unassigned. 

To  determine  whether  a priority  scheme  is  of  type  PIP  we  can 
look  at  the  graph  of  the  precedence  and  priority  relation.  The  graph 
consists  of  a set  of  nodes  and  directed  edges.  Each  node  represents  a 
distinct  computation  step  in  the  pipeline.  A directed  edge  exists 
from  a node  x to  node  y iff  either  step  x precedes  step  y or  step  x 
has  a priority  over  step  y.  The  priority  is  of  type  PIP  if  and  only 
if  the  graph  is  acyclic  (i.e.,  does  not  have  a closed  directed  path). 

A PIP  graph  is  acyclic  since  the  relation  on  the  graph,  namely,  x related 


to  y (x<y)  iff  x precedes  y or  x has  a priority  over  y is  transitive 
(x  < y,  y<z  =>  x<  z),  irreflexive  (x^x)  and  hence  assymetric 
(x<y  =>  y»tx).  This  relation  resembles  a partial  ordering  and  many 
well-known  authors  have  loosely  referred  to  it  as  a partial  ordering 
(e.g.  [KNU73]  pp.  258-259).  For  details  on  relations  and  their  graphs 
see  any  book  on  modern  algebra  (e.g.  [ST073]).  For  readability,  edges 
may  be  deleted  from  a precedence-priority  graph  if  they  would  be 
implied  by  transitivity. 

As  an  example,  take  the  pipeline  of  Figure  3.5.1a.  Figure 
3.5.1b  is  the  precedence  graph  of  that  pipeline.  The  nodes  in  the  graph 
are  labeled  with  the  number  and  type  of  the  computation  step  as  well  as 
the  name  of  the  segment  on  which  the  step  is  performed.  As  an  example, 
a precedence-priority  graph  of  a priority  of  type  PIP  on  the  same 
pipeline  is  given  in  Figure  3.5.1c. 

We  already  mentioned  the  similarity  of  LPF  and  MWRF 
priorities  to  PIP.  With  the  help  of  precedence-priority  graphs  we 
show  below  that  LPF  and  MWRF  priorities  are  indeed  of  type  PIP. 

To  see  why  LPF  is  of  type  PIP  consider  the  pipeline  of 
Figure  3.5.1a.  Group  together  the  computation  steps  of  each  function 
type  according  to  their  numbers,  that  is,  all  first  steps  in  one  group, 
all  second  steps  in  another  group  and  so  on.  This  grouping  is  indicated 
by  large  ovals  in  Figure  3.5.2a.  If  these  groups  are  laid  out 
according  to  the  step  numbers  in  an  increasing  order  (as  is  done  in 
Figure  3.5.2a),  the  edges  representing  the  precedence  will  always  be 
directed  from  left  to  right.  In  LPF,  lower  numbered  steps  have  priority 


Precedence-priority  graph. 


ecedence-priority  graph  of  a multifunction 
.peline. 


(a)  LPF  priority. 


I 

I 

I 

I 


(b)  MWRF  priority. 


Figure  3.5.2.  Precedence-priority  graphs  with  LPF  and  tWRF 
priorities. 


130 


over  higher  numbered  steps  and  therefore  the  edges  representing  the 
priority  between  distinctly  numbered  steps  will  also  be  directed  from 
left  to  right.  Thus  the  precedence  and  priority  edges  between  the 
groups  of  steps  cannot  form  a closed  path.  For  steps  within  a group 
the  only  edges,  if  any,  are  priority  edges  (e.g.  steps  numbered  2 in 
Figure  3.5.2a).  These  edges  do  not  form  a cycle,  because  if  they  do 
then  the  priority  among  these  steps  is  ambiguous  and  does  not  represent 
a valid  priority  scheme.  Thus  we  have  shown  that  the  precedence- 
priority  graph  of  a pipeline  with  LPF  priority  is  acyclic.  Therefore 
the  LPF  priority  is  of  type  PIP. 

For  MWRF  priority  we  group  the  steps  which  are  equally  far 
from  the  last  steps  of  their  respective  function  types.  Such  a grouping 
is  shown  in  Figure  3.5.2b  for  the  same  pipeline  as  above  (Figure  3.5.1a). 
Using  exactly  the  same  arguments  as  those  for  LPF,  we  conclude  that  the 
precedence-priority  graph  is  acyclic.  Thus  MWRF  priority  is  also  of 
type  PIP. 

To  simplify  the  following  discussion  of  PIP,  we  shall  assign 
an  integer  called  precedence -priority -number  (ppn)  to  each  node  in  the 
precedence-priority  graph.  The  assignment  of  ppn's  is  done  as  follows. 

It  is  assumed  that  the  priority  in  the  graph  is  of  type  PIP. 

Algorithm  3.5.1: 

PI.  1—  1; 

P2.  collect  all  the  nodes  in  the  graph  which  do  not  have  any  pre- 
decessors; assign  each  of  these  nodes  a ppn  value  equal  to  i; 


P3.  Remove  the  above  nodes  (i.e.  nodes  with  ppn=i)  and  their  outgoing 
arcs  from  the  graph; 

P4.  i *"  i+1;  if  no  more  nodes  left  in  the  graph  then  terminate  else  go 
to  P2 . a 

For  the  graphs  of  Figures  3.5.1c,  3.5.2a  and  3.5.2b  the 
assignment  of  ppn's  is  shown  in  Figures  3.5.3a,  b,  and  c,  respectively. 

The  ppn's  assigned  in  the  above  manner  satisfy  the  following 
properties . 

Steps  having  the  same  ppn  are  unrelated  by  precedence  as  well 
as  priority.  Nodes  with  lower  ppn  values  have  higher  priority.  The 
flow  of  tasks  in  the  pipeline  is  always  from  steps  with  lower  ppn 
towards  steps  with  higher  ppn.  These  properties  allow  us  to  prove  the 
following  important  theorem. 

Theorem  3.5.1:  In  a buffered  pipeline  with  PIP,  the  usage  pattern  of 

segments  eventually  (in  finite  time)  becomes  periodic  with  the  period 
of  the  initiation  cycle. 

Proof : Let  the  computation  steps  of  the  pipeline  be  assigned  the  ppn 

values  according  to  Algorithm  3.5.1.  We  shall  prove  the  theorem  by 
induction  on  ppn. 

Basis  step:  Take  the  steps  with  ppn=l.  These  steps  have  the  highest 

precedence.  The  tasks  arrive  at  these  steps  with  the  period  p,  the 
period  of  the  initiation  cycle.  Furthermore,  these  steps  have  the 
highest  priority  and  therefore  the  execution  of  these  steps  occur  with 
the  same  period  as  the  initiation  cycle,  namely,  p. 


133 


Induction  step:  Assume  that  the  execution  of  steps  with  ppn<  i is 

periodic  with  period  p.  We  show  that  the  execution  of  steps  with  ppn = i 
eventually  becomes  periodic  with  period  p. 

The  arrivals  at  the  steps  with  ppn  = i are  periodic  with 
period  p,  because  by  hypothesis,  all  the  predecessor  steps  (i.e., 
steps  with  ppn<  i)  are  being  executed  with  a period  p.  The  available 
execution  opportunities  for  the  steps  with  ppn = i are  also  periodic 
with  period  p for  the  same  reasons.  Furthermore,  the  activities  of  the 
steps  with  ppn>i  do  not  affect  the  execution  of  the  steps  with  ppn=i. 
Therefore  the  execution  of  steps  with  ppn = i eventually  becomes  periodic 
with  period  p. 

Thus  the  execution  of  all  9teps  eventually  becomes  periodic 
with  period  p.  Q.E.D. 

We  shall  say  that  a pipeline  is  in  the  steady  state  when 
all  the  computation  steps  are  being  executed  periodically.  The 
following  corollary  is  merely  an  interpretation  of  the  usage  pattern 
being  periodic  with  a period  p.  Recall  here  that  an  implementation  of 
one  buffer  per  computation  step  is  assumed  throughout  this  section. 
Corollary  3. 5. 1.1:  In  a buffered  pipeline  with  PIP  the  following  holds 

during  the  steady  state. 

1.  A task  T^  arrives  at  some  buffer  at  time  t =>Ti+n  arrives  at  the 

same  buffer  at  time  t+p.  V i>0. 

2.  A task  Ta  is  processed  by  some  segment  during  interval  (t,t+l)  for 

its  k^  computation  step  *>  is  processed  by  the  same  segment 

during  (t+p,t+p+l)  for  its  kfch  step.  V i>0.  U 


134 


The  following  lemmas  and  theorems  lead  to  the  establishment 
of  bounds  on  queue  size  and  wait  times  during  the  steady  state  of  the 
pipeline.  In  all  of  the  following,  "a  buffer  of  type  X"  means  "a  buffer 
at  some  computation  step  of  a task  of  type  X." 

Lemma  3. 5. 2.1:  During  the  steady  state  of  a buffered  pipeline  with  PIP, 

no  more  than  n^  tasks  of  type  X arrive  at  and  no  more  than  n^  tasks  of 
type  X depart  from  a buffer  of  type  X,  during  any  time  interval 
<t,t+p) . 

Proof : Consider  the  buffer  of  the  first  step  of  function  X.  Exactly 

nx  tasks  arrive  at  this  buffer  during  any  interval  (t,t+p).  In  the 
steady  state,  from  Corollary  3.5. 1.1,  if  more  than  n^  tasks  leave  the 
buffer  during  some  interval  (t^t'+p)  then  more  than  n tasks  must 

A. 

leave  the  same  buffer  during  (t '+p, t '+2p) ,( t ' +2p,t '+3p) , . . . . and  so 
on.  However,  this  departure  rate  cannot  go  on  indefinitely  because 
more  tasks  would  leave  the  buffer  than  would  arrive.  Since  the  buffer 
can  only  contain  a finite  number  of  tasks  to  start  with,  the  steady 
state  assumption  would  be  violated.  Thus  no  more  than  n^  tasks  arrive 
at  or  leave  the  first  buffer  during  any  interval  (t,t+p)  in  steady 
state. 

Whatever  leaves  the  first  buffer  goes  to  the  second  buffer. 
Thus  the  second  buffer  receives  no  more  than  tasks  during  any 
interval  (t,t+p).  Similarly,  no  more  than  n^  tasks  leave  the  second 
buffer  during  any  interval  (t,t+p).  By  finite  induction,  the  theorem  is 


proved . 


Q.E.D. 


I 


135 


I 

I 

I 

I 


When  the  above  lemma  is  applied  simultaneously  to  all  the 
buffers  of  a segment  the  following  lemma  results. 

Lemma  3. 5, 2. 2:  During  the  steady  state  of  a buffered  pipeline  wit..  PIP, 

the  segment  i receives  no  more  than  E m n tasks  and  it  processes  no 

x IX  X 

more  then  the  same  number  of  tasks  during  any  interval  (t,t+p).  □ 

Lemma  3 . 5 . 2 . 3 : During  the  steady  state  of  a pipeline  with  PIP  and  a 

workload  < 100%,  every  buffer  at  segment  i is  empty  for  at  least  one 

unit  time  interval  during  any  interval  (t,t+E  m n ). 

X J-X  X 

Proof : If  a buffer  associated  with  segment  i remains  nonempty  for 
X miXnX  consecut^ve  units  then  segment  i is  busy  for  at  least 

X mixnX  + * consecutive  time  units.  The  workload  being  < 100%  implies 
that  Z m^xnx  < P>  £ m^nY  + 1 < p.  Thus  segment  i remains  busy 


iX  X 


for  E m, „n„  + 1 units  during  some  interval  (t,t+p).  This  statement 
X iX  X 

contradicts  Lemma  3. 5. 2. 2.  Hence,  the  buffer  must  remain  empty  for  at 

least  one  time  unit  during  any  interval  (t,t+£  m n ).  Q.E.D. 

X 1X  x 

The  above  lemma  is  our  assurance  that  irrespective  of  the 
priority  within  a buffer,  all  tasks  will  be  processed  when  workload 
is  < 100%.  Thus  for  example  we  may  use  a LIFO-buffer  without  the 
fear  that  a task  may  remain  in  some  buffer  indefinitely.  However, 
notice  that  we  have  assumed  a workload  strictly  < 100%,,  a deviation  from 
our  standard  assumption  of  workload  < 100%.  Unfortunately  these 
results  are  not  fully  applicable  with  100%  workload  as  discussed 
later  in  this  section. 


136 


Theorem  3.5.2 : During  the  steady  state  of  a buffered  pipeline  with 

PIP  and  a workload  < 100%,  the  number  of  tasks  in  any  buffer  of  type  X 
is  always  £ n^. 

Proof : Consider  a particular  buffer  of  type  X.  By  Lemma  3. 5. 2. 3 the 

buffer  is  empty  for  at  least  one  time  unit  during  any  interval 
(t,t+E  mixnx^  * Suppose  this  unit  interval  during  which  the  buffer  is 
empty  is  (t',t'+l).  Now  due  to  the  steady  state,  the  buffer  is  also 
empty  during  (t '+p,t '+p+l)fc  Moreover  by  Lemma  3. 5. 2. 2 no  more  than 
nx  tasks  can  arrive  during  the  interval  <t',t'+p).  Thus  no  more  than 
n^  tasks  can  accumulate  in  the  buffer.  Q.E.D. 

Theorem  3.5.3:  During  the  steady  state  of  a buffered  pipeline  with  PIP 

and  a workload  < 100%,  the  following  bounds  on  wait  times  hold: 

1.  The  maximum  wait  time  of  a task  for  a computation  step  at  segment  i 
is 

* "txV  • l- 


2.  The  expected  wait  time  of  a task  for  a computation  step  at  segment  i 
is 

v 2 

1 v “iXX 

i T 2 m,  n - , v — “ — . 

2 X iX  X 2 E mixnx 

Proof : 1.  Directly  follows  from  Lemma  3. 5. 2. 3. 

2.  Once  we  establish  that  all  the  buffers  at  segment  i remain  empty 
simultaneously  for  one  time  unit  during  some  interval  <t,t+E 
the  rest  of  the  proof  is  identical  to  the  proof  (2)  of  Theorem  3.3.3. 


Consider  the  lowest  priority  buffer  at  segment  i.  When 
this  buffer  becomes  empty  due  to  selection  of  the  last  task  in  the 


buffer  for  execution,  it  is  implied  that  all  higher  priority  buffers 
must  be  empty.  Thus  all  buffers  become  empty  simultaneously  at  least 
once  during  any  period  (t,t+p).  From  this  instant  on  all  the  arguments 
in  the  proof  of  Theorem  3.3.3  apply  and  therefore  we  omit  here  the 
rest  of  the  proof.  Q.E.D. 

As  stated  earlier  in  Theorem  3.4.1,  the  bounds  on  the  queue 
length  and  the  expected  wait  time  are  independent  of  the  priority  used 
within  a buffer.  However,  the  bound  on  the  maximum  wait  time  does 
depend  on  this  priority.  In  our  proof  of  (1)  of  Theorem  3.5.3,  we 
did  not  use  any  specific  priority  within  a buffer.  A tighter  bound 
may  be  obtained  for  some  specific  priority. 

Take  for  example  FIFO-buffer.  Let  the  lowest  priority  buffer 

for  the  segment  i be  of  type  Y.  Clearly,  the  worst  case  delay  can 

occur  only  for  the  tasks  in  the  lowest  priority  buffer.  Suppose  a task 

arrives  at  this  buffer  when  it  is  empty.  Since  the  buffer  is  first-in- 

first-out,  this  task  can  only  be  delayed  by  the  tasks  from  other 

buffers.  During  every  period  (t,t+p)  a total  of  £ m n steps  are 

X iX  X 

executed  on  segment  i,  of  these  n^  tasks  are  taken  from  the  lowest 
priority  buffer.  Therefore  the  task  under  consideration  can  be  delayed 
for  at  most  £ m.  n ■ time  units.  It  can  be  shown  that  even  if  a 
task  arrived  at  the  buffer  when  the  buffer  was  not  empty,  the  task 

cannot  be  delayed  more  than  £ m n ” nv  t^me  units. 

X iX  X Y 


138 


For  LIFO-buffer  the  bound  can  be  shown  to  be  the  same  as  in 

(1)  of  Theorem  3.5.3.  For  other  priorities  within  the  buffer  the 

bound  would  lie  between  the  bounds  for  FIFO-buffer  and  LIFO-buffer. 

All  the  bounds  shown  so  far  are  proven  under  two  restrictions. 

One  is  that  the  workload  be  strictly  less  than  100%.  If  the  workload 

is  assumed  to  be  equal  to  100%  then  the  arguments  in  the  proof  of 

Lemma  3. 5. 2. 3 are  no  longer  applicable  to  the  buffers  with  the  least 

priority  on  each  segment  i for  which  E m.  n = p.  However,  the  result 

X iX  X 

of  that  lemma  is  still  valid  for  other  buffers  and  hence  the  bounds  of 
Theorems  3.5.2  and  3.5.3  are  also  valid  for  these  buffers. 

Another  restriction  that  we  imposed  in  the  above  theorems  is 
that  of  steady  state.  We  have  put  considerable  effort  in  investigating 
the  transients.  As  yet,  we  do  not  have  any  concrete  results  to  report. 
However,  we  have  gained  considerable  insight  into  the  problem  and 
this  insight  has  lead  us  to  believe  the  following  two  conjectures. 
Conjecture  3.5.1:  A buffered  pipeline  with  PIP  and  workload  < 100% 

reaches  a steady  state  within  the  number  of  initiation  cycle  periods 
equal  to  the  value  of  the  largest  ppn  in  the  precedence-priority  graph 
of  the  pipeline.  □ 

Conjecture  3.5.2:  In  a buffered  pipeline  with  PIP,  the  bounds  of 

Theorems  3.5.2  and  3.5.3  are  valid  during  the  transient  and  steady 
state  and  for  all  workloads  < 100%.  □ 

The  above  conjecture  can  be  shown  to  be  true  in  a very  special 
case  namely,  single  function  pipelines  with  constant  latency  cycles.  In 
a single  function  pipeline  PIP  is  the  same  as  LPF  and  MWRF.  Therefore 


X. 


139 


by  Theorem  3.4.3  the  task  flows  are  identical  with  LIFO-global  and  PIP 
in  this  special  case.  Thus  the  results  concerning  LIFO-global 
priority  directly  apply  to  PIP  as  stated  below. 

Theorem  3.5.4:  In  a buffered  single  function  pipeline  with  PIP  and  a 

constant  latency  cycle  with  workload  < 100%  the  following  bounds  are 
satisfied . 

1.  Maximum  number  of  tasks  in  a buffer  is  < 1. 

2.  Maximum  wait  time  of  a task  for  a computation  step  at  segment  i is 
< (nu-1). 

3.  The  expected  wait  time  of  a task  for  a computation  step  at  segment 
i is  < j (nu-1). 

Proof : 1.  Follows  from  the  proof  of  Theorem  3.4.3  namely,  that  no  two 

tasks  can  be  together  in  a buffer  at  any  instant. 

2.  By  Theorem  3.4.3  the  bound  in  (1)  of  Corollary  3.3.3. 1 
applies,  namely,  (nun-1).  For  constant  latency  cycles  n is  1 and 
therefore  the  bound  is  (m^-1). 

3.  For  the  same  reasons  as  above  the  bound  (2)  of  Corollary 

3. 3. 3.1  applies  which  at  n=l  is  j (m  -1).  Q.E.D. 

Regarding  the  tightness  of  the  bounds  of  Theorems  3.5.2  and 
3.5.3,  the  same  comments  as  on  the  bounds  of  LIFO-global  priority 
apply.  Specifically,  the  bounds  are  the  best  possible  with  the 
information  used,  but  in  general,  the  actual  maximum  queue  size  and 
wait  time  may  be  considerably  less  than  the  stated  bounds. 


140 


3.6.  Cone lud ins  Remarks 

The  principal  result  of  this  chapter  is  that  one  can  always 
obtain  the  maximum  throughput  of  a pipeline  with  the  use  of  internal 
buffers.  Moreover,  one  has  complete  freedom  in  choosing  an  initiation 
cycle.  We  have  shown  that  any  initiation  cycle  with  a workload  < 100% 
can  be  processed  by  a pipeline  with  internal  buffers  using  either  LIFO- 
global  priority  or  any  other  priority  of  type  PIP.  Two  priorities  of 
interest,  namely,  LPF  and  MWRF  were  shown  to  be  of  type  PIP. 

We  also  presented  the  bounds  on  queue  size  and  wait  time  for 
LIFO-global  and  PIP.  Although  for  PIP  we  were  able  to  prove  these 
bounds  only  when  we  had  assumed  that  workload  was  < 100%  and  the 
pipeline  was  in  steady  state.  The  bounds  were  found  to  be  the  best 
possible  with  the  information  used.  Some  examples  tried  by  us  have 
shown  that  the  actual  queue  size  and  wait  time  are  likely  to  be 
considerably  less  than  the  stated  upper  bounds.  For  a particular 
pipeline  we  suggest  simulation  to  determine  the  actual  bounds.  At 
least  for  LIFO-global  and  PIP  we  know  by  experience  that  the  simulation 
is  required  only  for  a short  time  because  the  steady  state  is  reached 
in  a short  time.  In  fact,  in  most  cases  hand  computation  is  sufficient 
to  trace  out  the  task  flows  until  the  flow  becomes  periodic. 

None  of  the  above  comments  are  applicable  to  FIFO-global, 

MPF  and  LWRF  priorities.  Nevertheless,  we  have  not  yet  found  any 
example  in  which  the  steady  state  was  not  reached.  Although  not  yet 
proven,  we  believe  that  the  queues  are  bounded  and  that  the  maximum 
throughput  is  possible  with  any  cycle. 


For  guidance  in  choosing  a cycle  with  any  of  the  above 
priorities,  the  same  comments  can  be  made  as  in  Section  2.11.  Namely, 
that  a cycle  with  small  period  and  evenly  spaced  task  initiations  is 
likely  to  produce  small  queue  sizes  and  small  wait  times. 

In  this  chapter  we  did  not  discuss  how  fast  one  can  switch 
from  one  cycle  to  another.  As  far  as  PIP,  FIFO-buffer  is  concerned 
the  problem  is  not  serious  except  that  we  cannot  predict  the  queue 
bounds  and  wait  times  precisely.  While  in  LIFO-global,  the  possibility 
exists  that  some  tasks  from  the  previous  cycle  may  never  get  processed. 
Therefore  we  suggest  that  any  such  switching  must  be  done  with  care,  and 
a simulation  should  be  carried  out  to  predict  any  such  unusual  behavior. 
Of  course,  one  can  always  use  the  conservative  approach  and  let  the 
pipeline  become  empty  before  another  cycle  is  started.  This  practice 
can  result  in  a degradation  of  throughput  if  the  switching  of  cycles  is 
frequent . 

Results  of  this  chapter  are  applicable  only  in  a well-defined 
environment,  such  as  in  a vector  machine  where  vector  instructions 
can  be  conveniently  converted  into  initiation  cycles.  When  the 
switching  of  cycles  is  frequent  or  when  the  arrivals  are  random,  or 
unpredictable,  the  buffers  are  still  useful  in  resolving  the  collisions. 
The  next  chapter  deals  with  buffered  pipelines  with  random  arrivals. 


142 


Chapter  4 

„ INTERNAL  BUFFERS  WITH  RANDOM  ARRIVALS 

4.1.  Introduction 

In  this  chapter,  we  investigate  pipelines  with  internal 
buffers  and  random  rather  than  periodic  initiations.  In  Chapter  1 we 
indicated  the  need  for  processing  random  sequences  of  tasks  in  the  order 
their  arrival.  The  arithmetic  units  of  the  existing  pipeline  computers 
(e.g.  TI-ASC  and  CDC-STAR)  use  a very  conservative  approach  to  this 
problem.  In  these  machines  only  those  instructions  are  overlapped  which 
are  of  the  same  type,  e.g.,  a sequence  of  multiplies.  A sequence  of 
distinct  instructions  is  executed  without  any  overlap,  that  is,  one 
instruction  at  a time.  Since  a program  of  general  nature  normally  does 
not  contain  many  sequences  of  consecutive  instructions  of  the  same  type, 
these  machines  produce  a throughput  considerably  below  the  possible 
maximum  throughput.  Even  "intelligent"  compilers  for  such  machines 
which  try  to  compile  programs  into  "vector  instructions"  (i.e.,  sequences 
of  operations  of  the  same  type)  find  substantial  portions  of  code  non- 
vector izable  (i.e.,  "scalar").  The  only  other  proposed  control  strategy 
for  scalar  instruction  overlap  is  the  greedy  strategy  of  Davidson 
[DAV71]  and  Thomas  [TH074] ; which  even  though  superior  to  the  existing 
strategies,  still  produces  a throughput  well  below  the  possible  maximum 
throughput,  as  we  shall  see  later  in  this  chapter.  Here  we  propose  yet 
another  alternative,  namely,  the  use  of  internal  buffers. 


"VT* 


I 

I 

I 

I 


I 

I 

I 

\ 


143 

We  shall  experimentally  evaluate  the  effectiveness  of  Internal 
buffers  and  greedy  control  in  this  chapter.  We  shall  investigate  by 
simulation  only  those  priority  schemes  which  are  based  on  the  processing 
state  because  of  their  simpler  implementation  than  FIFO-global  and 
LIFO-global  as  was  described  in  Section  3.4.  Moreover,  FIFO-global  is 
very  similar  to  MPF  and  LWRF  as  indicated  by  Theorem  3.4.1.  LIFO- 
global  cannot  be  used,  because  there  exists  a possibility  that  some  tasks 
may  never  get  processed  or  they  may  be  forced  to  wait  for  an  unusually 
long  time.  For  the  same  reasons  LIFO-buffers  are  ruled  out.  The 
buffering  scheme  is  assumed  to  be  one  FIFO  buffer  per  computation  step. 

The  effect  of  the  following  parameters  on  performance  will  be 
investigated : 

1.  Buffer  size 

2.  Workload 

3.  Priority  scheme. 

All  buffers  except  the  input  buffers  are  assumed  to  have  the 
same  size.  The  input  buffer  size  is  assumed  to  be  unbounded.  Some 
modification  of  the  priority  schemes  is  necessary  because  of  the 
finiteness  of  the  buffers  inside  the  pipeline.  Before  we  describe  the 
modification  recall  that  an  equivalent  of  three  step  action  takes 
place  at  every  time  instant  (Section  3.2).  First,  all  tasks  leave  the 
segments  and  arrive  in  their  next  buffers;  second,  the  priority  control 
selects  the  tasks  with  highest  priorities  and  third,  the  selected  tasks 
are  forwarded  to  the  appropriate  segments.  The  modification  imposed  is 
that  the  priority  control  considers  only  those  buffers  whose  successor 


buffers  are  not  full  at  the  current  time  instant.  Thus  a task  will 
not  leave  a buffer  if  its  successor  buffer  is  full  even  if  the  segment 
may  remain  idle.  This  modification  guarantees  that  a segment  can  always 
deliver  a task  to  the  next  buffer  without  the  possibility  of  an  overflow. 

Note  that  the  priority  control  considers  only  those  buffers 
whose  successor  buffers  are  not  full  at  the  current  time  instant.  It 
is  possible  to  avoid  an  overflow  by  considering  only  those  buffers 
whose  successor  buffers  are  not  going  to  be  full  at  the  next  time 
instant.  However,  the  latter  implementation  is  far  more  complicated 
than  the  previous  one  since  the  decision  process  has  to  occur  at  a 
global  level  rather  than  at  local  level.  Moreover,  this  decision 
process  is  sequential  rather  than  parallel  since  the  selection  of  a 
task  from  a buffer  might  cause  a selection  of  another  task  from  its 
predecessor  buffer  and  a task  from  pre-predecessor  buffer  and  so  on. 

We  shall  adhere  to  the  modification  suggested  originally. 

The  following  priority  schemes  will  be  studied.  A * indicates  that  the 
priority  as  explained  in  Section  3.4  is  modified  with  respect  to 
successor  buffers  in  the  manner  described  above: 


1. 

MPF* : 

Modified 

Most  Processed  First 

2. 

LWRF* : 

Modified 

Least  Work  Remaining  First 

3. 

LPF* : 

Modified 

Least  Processed  First 

4. 

MWRF* : 

Modififed  Most  Work  Remaining  First 

5. 

RANDOM*: 

Modified 

Random. 

In  RANDOM,  the  priority  is  randomly  assigned  among  buffers. 
Like  all  other  priorities  we  have  considered,  in  RANDOM  also  we  assume 


145 


that  once  the  priority  is  assigned,  it  remains  fixed.  RANDOM  serves 
as  a basis  of  comparison  for  the  performance  of  other  priority  schemes. 

The  following  assumptions  are  made  regarding  task  arrivals: 

1.  At  most  one  task  arrives  at  each  buffer  in  a unit  time  interval. 

2.  Arrivals  at  the  pipeline  are  random  according  to  a binomial  distri- 
bution and  independent  of  the  processing  rate. 

3.  The  mean  arrival  rate  of  tasks  of  each  function  type  is  fixed. 

The  performance  will  be  judged  based  on  the  following 
parameters : 

1.  Segment  utilization. 

2.  Expected  average  wait  time  of  a task  from  arrival  to  termination, 

W1 " 

w^  = (time  of  termination)  - (time  of  arrival)  - (time  to 
execute  a task  without  buffers). 

3.  Expected  average  wait  time  of  a task  from  initiation  in  the 
pipeline  to  termination,  w2 . 

w2  = (time  of  termination)  - (time  of  initiation)  - (time  to 
execute  a task  without  buffers). 

The  throughput  of  a pipeline  is  proportional  to  segment 
utilization  when  the  buffers  are  assumed  finite  in  size.  A 100% 
utilization  of  some  segment  corresponds  to  the  maximum  possible 
throughput . 

The  second  parameter  w^  above  gives  an  indication  of  the 
turn  around  time  of  a task.  A small  turn  around  time  is  a necessity  in 
many  real-time  applications. 


-..W- 


146 


I 

\ 

i 

The  third  parameter  differs  from  the  second  parameter  in 
that  the  wait  incurred  in  the  input  buffer  is  not  included.  A small 
is  preferred  in  some  situations  for  the  following  reasons. 

A large  w2  implies  a large  number  of  tasks  inside  the  pipeline. 
These  tasks  are  partially  processed  and  therefore  if  the  normal 
processing  is  to  be  suspended  for  any  reasons  (e.g.,  due  to  an 
interrupt)  all  the  partially  processed  tasks  either  must  be  completed 
or  restored  to  their  original  state.  Thus  a large  number  of  partially 
processed  tasks  implies  a slow  response  to  say  an  interrupt.  Thus  a 
small  w2  is  desirable  in  some  cases. 


% 

■4 

‘N 

l 

" t 

% .. 

I 

l 

l 

, l 


4.2.  Experimental  Investigation 

Analytical  results  for  internally  buffered  pipelines  with 
random  arrivals  are  very  difficult  to  obtain.  Markov  analysis  can  be 
applied.  However,  even  for  a small  pipeline  this  analysis  can  become 
quite  complex.  Markov  analysis  was  applied  to  a few  trivial  pipelines 
for  the  sole  purpose  of  verifying  the  validity  of  our  simulation 
techniques  by  comparing  the  results  of  these  two  methods. 

For  simulation  the  pipelines  were  created  with  random  task 
flows.  For  a given  number  of  segments  and  a given  number  of  function 
types , each  type  of  task  flow  was  determined  by  a random  sequence  of 
integers,  each  integer  taken  between  1 and  the  number  of  segments  with 
equal  probability.  The  length  of  the  sequence  (which  is  the  same  as 


the  execution  time  of  the  task)  was  also  taken  arbitrarily,  however 
taking  care  that  it  is  not  too  long  and  thus  not  too  complex 
computationally . 

The  pipelines  of  Table  4.2.1,  created  in  the  above  manner 
were  used  for  simulation.  The  pipeline  designation  j .k  stands  for  a 
pipeline  with  j function  types  and  k segments.  The  flow  of  each 
function  type  is  given  by  a sequence  of  integers,  where  each  integer 
stands  for  a segment  number. 

In  each  case  the  simultion  was  done  for  a fixed  duration  and 
no  attempt  was  made  to  continue  the  simulation  until  a steady  state  was 
reached  because  in  theory,  for  some  cases  a steady  state  may  not  exist, 
and  even  if  it  did  exist  it  could  take  an  unreasonably  long  time  to 
reach  it.  However,  for  most  practical  purposes  a steady  state  was 
reached  in  many  cases  during  the  simulation  period,  especially  at  low 
and  medium  workloads.  The  following  is  a discussion  of  the  simulation 
result s . 


4,3.  Observations  on  the  Experimental  Results 

We  have  presented  in  Table  4.3.1  some  compacted  results  of 
our  simulation.  Each  pipeline  of  Table  4.2.1  was  simulated  for  three 
different  workloads.  In  Table  4.3.1,  the  column  marked  W indicates  the 
workload,  where  HI,  MED  and  IXJW  are  respectively  100%,  75%  and  50% 
workload.  The  column  marked  BUF  indicates  the  buffer  size  in  each  run. 
U%  is  the  maximum  utilization  of  any  segment  in  the  pipeline  and  w^  and 


I 


TABLE  H.2.1 

Pipelines  used  for  simulation 


Pipeline  Flow  of  Tasks 

1.4  FI:  4 1 4 1 3 1 4 2 2 2 1 

4.4  FI:  4 3 4 4 

F2:  141331422 
F3:  221 

F4:  3 1 2 2 4 4 2 1 

4.8  FI -.512577452116386756 
F2:  826516656614 
F3:  6583138857 
F4:  3683725 

4.16  FI:  8 2 6 8 2 7 13  7 3 4 6 14  2 11  6 13  9 2 5 13  5 
F2:  1 16  13  2 11  10  11  9 2 1 
F3:  2 8 6 3 13  16  12  4 4 7 3 
F4:  2 6 15  8 2 15  12  11  2 1 8 9 

8.4  FI:  2 1 2 2 1 2 4 2 1 1 2 
F2:  1 2 

F3:4214331  1231132224121342414321 
F4 : 1 131  1243434 
F5:  1221  1344 
F6:  2 1 4 1 2 1 

F7:  2143  1 443443321 
F8:  3 2 

T 

I 

I 

I 

I 


149 


are  the  wait  times  as  defined  earlier  in  Section  4.1.  Figures  4.3.1 
through  4.3.4  graphically  illustrate  the  effect  of  various  parameters 
on  performance  for  a "typical"  example.  The  pipeline  4.4  of  Table  4.2.1 
was  used  for  this  example.  The  data  of  these  figures  do  not  appear  in 
Table  4.3.1.  The  example  is  typical  in  the  sense  that  a similar  effect 
can  also  be  observed  for  most  examples  in  Table  4.3.1.  However,  it 
should  be  pointed  out  that  anomalies  always  exist  which  do  not  exhibit 
the  typical  behavior. 

The  observations  below  can  be  made  from  the  data  of  Table  4.3.1 
and  Figures  4.3.1  through  4.3.4.  As  described  before,  w^  refers  to  the 
expected  average  wait  time  of  a task  from  arrival  to  termination  and 
W£  refers  to  the  expected  average  wait  time  of  a task  from  its 
initiation  in  the  pipeline  to  its  termination. 

1.  Segment  Utilization  (See  Table  4.3.1):  The  maximum  segment  utilization 
closely  follows  the  workload  in  most  cases  except  in  a few  cases 
when  the  workload  is  high  (~  99%)  and  buffer  size  is  small.  A 
possible  explanation  for  this  is  derived  from  the  modification  of 
the  priority  strategies.  Recall  that  a task  may  not  leave  a buffer 
if  its  successor  buffer  is  full  even  if  the  segment  is  available. 

A large  number  of  buffers  are  likely  to  be  full  if  their  size  is 
small  and  workload  is  high;  thus  forcing  some  segments  to  become 
idle  some  of  the  time.  In  this  type  of  situation  it  is  apparent 
that  an  increase  in  buffer  size  results  in  higher  utilization  of 
the  pipeline.  The  utilization  is  otherwise  somewhat  insensitive  to 
the  buffer  size  used.  It  is  also  insensitive  to  the  priority  used. 


-w 


Table  4.3.1.  Simulation  results. 


PIPE  W 


RANDOM* 

LPF* 

MWRF* 

MPF* 

LWRF* 

U% 

wl 

w2 

U% 

wl 

w2 

U% 

wl 

w2 

96.5 

35 

21 

94.6  74  49 

94.6  74  49 

96.6 

28 

5 

96. 

6 

28 

5 

96.6 

39 

31 

96.3  73  65 

96.3  73  65 

96.6 

28 

8 

96. 

6 

28 

8 

96.6 

58 

58 

96.6  105  105 

96.6  105  105 

96.6 

28 

27 

96. 

6 

28 

27 

49.4 

5 

5 

49.5 

5 

5 

49.5 

5 

5 

4 
4 

15  15 


5 49.4  2 
5 49.5  2 
549.5  2 


9.4  2 

9.5  2 

9.5  2 


M 2 
4.8  E 3 


92.6  10 

94.6  9 


8 72.6 


13  56.5 


8 74.1 
8 74.1 
8 74.1 


.0 

5 

3 

. 1 

5 

3 

51.4 

.1 

5 

3 

1 50.6  3 

2 50.5  3 

”50.5  3 


2 99.2  61 

3 99.3  64 

” 99.4  63 


M 

4.16  E 
D 


21  99.4 


4 

72.8 

5 

72.8 

5 

2 1 5 

2 2 5 

2 2 50 . 1 


2 96.6  122  9 98.2 

3 97.3  110  10  98.6 


M 2 
8.4  E 3 
D 


13 

10 

12 

10 

12 

10 

99.3 

83 

48 

99.4 

86 

60 

99.3 

78 

78 

5 

5 

72.8 

5 

5 

72.8 

5 

5 

50.1 

2 

1 

50.1 

2 

2 

50.1 

2 

2 

98.8 

61 

46 

99.1 

64 

47 

99.4 

93 

66 

72.9 

22 

22 

72.9 

23 

23 

72.9 

28 

28 

50.8 

9 

8 

50.8 

10 

10 

50.8 

11 

10 

50.8  5 

50.8  5 

50.8  5 


3 

3 54.4 

3 


2 

4 61.4 
6 


155 


For  comparison  we  have  also  simulated  the  greedy  strategy. 

The  maximum  segment  utilization  achieved  in  each  case  is  listed  in 
Table  4.3.1.  Excepting  the  single  function  pipeline,  the  maximum 
utilization  achieved  in  all  examples  is  in  the  neighborhood  of 
55  to  60%.  Since  the  throughput  is  proportional  to  segment  utili- 
zation we  see  that  with  greedy  strategy,  the  throughput  remains 
considerably  below  the  theoretical  maximum.  In  contrast  to  this, 
internally  buffered  pipelines  achieved  a throughput  very  near  to 
the  theoretical  maximum  in  most  of  the  examples  that  we  tried. 

2.  The  effect  of  workload  on  wait  time  (see  Figures  4.3.1  and  4.3.3): 

It  is  quite  apparent  that  irrespective  of  the  buffer  size  and 
priority,  the  wait  times  w^  and  are  relatively  insensitive  to 
workload  in  the  low  and  medium  load  region  and  both  these  wait 
times  rise  rapidly  in  the  90%  to  100%  workload  region. 

The  reason  for  this  is  not  hard  to  see.  At  high  workloads 
the  possibility  of  task  collisions  increases  because  at  any  time 
instant  a pipeline  is  likely  to  have  more  tasks  present  at  high 
workloads  than  at  low  workloads.  Furthermore,  when  the  number  of 
tasks  in  the  pipeline  is  high,  one  collision  can  trigger  multiple 
collisions,  thus  further  adding  to  the  wait  time. 

3.  The  effect  of  buffer  size  on  wait  time  w^  (see  Figure  4.3.2):  The 

expected  average  wait  time  of  a task  is  affected  very  little  by 
change  in  buffer  size  beyond  the  size  of  2 in  the  medium  workload 
(50%  to  90%)  region  and  is  almost  unaffected  by  buffer  sizes  in  the 
low  workload  (<  50%)  region.  In  the  very  high  workload  region,  with 


156 


very  small  buffers  the  increase  in  buffer  size  would  normally 
result  in  a decrease  in  wait  time  w^.  This  effect  is  similar  to 
the  effect  on  segment  utilization  and  for  the  same  reasons.  It 
is  interesting  to  note  a dip  in  the  curves  for  99.8  and  90%  work- 
loads in  Figure  4.3.2.  The  dip  is  not  very  significant  but  it  is 
still  somewhat  puzzling.  We  do  not  have  a reasonable  explanation 
for  that  dip. 

4.  The  effect  of  buffer  size  on  wait  time  (see  Figure  4.3.4): 

From  Table  4.3.1  and  Figure  4.3.4  it  is  evident  that  w^,  the  expected 
average  wait  time  of  a task  inside  a pipeline  initially  increases 
with  the  increase  in  buffer  size  and  then  approaches  a constant 
value.  This  happens  because  when  the  buffers  are  very  small,  most 
of  the  tasks  accumulate  in  the  input  buffer  and  thus  a large  portion 
of  total  wait  time  is  incurred  in  the  input  buffers.  Therefore 
W£  remains  small.  Any  increase  in  the  buffer  size  results  in 
reducing  the  number  of  tasks  in  the  input  buffers  and  increasing 
the  wait  time  0nce  the  buffer  size  is  sufficiently  high, 

buffers  are  rarely  full  and  thus  any  increase  in  buffer  size 
has  very  negligible  effect  on  the  task  flow.  Hence  the  wait  time 
W£  remains  unaffected  in  this  region. 

5.  The  effect  of  priority  (see  Table  4.3.1):  In  most  cases  the 

smallest  wait  times  occur  with  MPF*  and  LWRF*  priorities;  slightly 
higher  wait  times  occur  with  RANDOM*  priority  and  considerably 
higher  wait  times  occur  with  LPF*  and  MWRF*  priorities.  The 


157 


following  explanation  can  be  given  for  this  effect  when  the  buffers 
are  assumed  finite. 

In  MPF*  or  LWRF*  priorities,  high  priority  buffers  remain 
empty  most  of  the  time.  The  same  is  not  necessarily  true  in  LPF*  or 
MWRF*  priority.  Consider  what  happens  when  a low  priority  buffer  is 
full.  By  the  modification  imposed  on  priorities  with  finite  buffers, 
a task  cannot  leave  a buffer  if  its  successor  buffer  is  full.  Thus  a 
high  priority  buffer  may  become  full  because  of  its  successor  buffer 
being  full.  This  effect  may  be  propagated  to  the  highest  priority 
buffer.  Thus  even  the  highest  priority  buffer  may  not  be  empty  some 
of  the  time.  In  the  worst  case  it  may  happen  that  most  of  the  buffers 
remain  nearly  full,  thus  resulting  in  a very  high  wait  time.  This 
argument,  of  course  cannot  be  applied  when  the  buffers  are  assumed 
infinite  or  when  the  workload  is  low.  When  the  workload  is  low,  the 
differences  in  wait  times  for  various  priorities  are  negligible.  While 
at  high  workloads  and  infinite  buffers  the  data  of  Table  A. 3.1  does 
not  conclusively  indicate  the  superiority  of  one  priority  over  others. 


4.4.  Concluding  Remarks 

We  observed  that  a workload  very  near  to  1007.,  can  be  executed 
with  internal  buffers.  In  other  words  a throughput  very  near  to  the 
theoretical  maximum  can  be  attained.  Moreover,  the  choice  of  priority 
is  not  very  critical  as  far  as  throughput  is  concerned.  However,  if 


158 


small  wait  times  are  preferred  then  the  MPF*  and  IMtF*  priorities  were 
found  to  be  superior. 

It  should  be  noted  here  that  ours  is  an  open  loop  analysis. 

In  a closed  loop  model  the  arrival  rate  of  tasks  will  depend  on  the 
processing  rate  and  the  wait  times.  Normally  this  dependency  will  have 
a stabilizing  effect  on  the  wait  time.  For  example  in  a computer  with 
a pipelined  arithmetic  unit,  an  increase  in  wait  time  in  the  arithmetic 
unit  results  in  a reduced  rate  of  arrivals  of  instructions,  which  in 
turn  results  in  a lower  wait  time.  The  throughput,  however,  will  also 
decrease.  We  did  not  attempt  a closed  loop  analysis  since  it  is 
highly  problem  dependent  and  since  we  are  considering  pipelines  in 
general. 

Although  we  shall  not  present  any  specific  data  it  should  be 
pointed  out  that  the  priority  can  be  "fine-tuned"  to  advantage.  We 
suggest  the  following  fine-tuning  procedure,  which  we  have  found  very 
effective  in  our  limited  * rience. 

Start  with  either  MPF*  or  LWRF*  priority  and  do  a simulation 
with  appropriate  arrival  rates.  Gather  the  individual  wait  time  and 
queue  size  statistics  of  each  buffer.  Increase  the  priority  of  those 
buffers  in  which  a disproportionately  large  amount  of  wait  time  is 
incurred  and  those  buffers  which  remain  full  most  of  the  time.  Decrease 
the  priority  of  those  buffers  which  remain  empty  most  of  the  time. 

After  the  priority  is  changed  conduct  the  simulation  again  and  readjust 
the  priority  in  the  above  manner.  Since  this  is  somewhat  of  a trial- 
and -error  procedure  and  since  the  results  are  not  always  predictable. 


several  runs  might  be  necessary.  In  a similar  manner  it  is  also 
possible  to  come  up  with  an  "optimum  size"  of  buffer  for  each  step. 

We  also  simulated  the  greedy  strategy  in  this  chapter.  We 
observed  that  in  most  cases  a workload  of  up  to  55  to  60%  can  be 
executed.  Thus  the  maximum  throughput  achievable  with  greedy  strategy 
is  considerably  below  the  throughput  with  internal  buffers.  Even  though 
the  greedy  strategy  compares  poorly  with  internal  buffers  in  throughput, 
it  is  not  necessarily  less  cost-effective  than  internal  buffers.  We 
shall  make  some  comments  regarding  cost-effectiveness  in  the  next 


chapter . 


160 


Chapter  5 

DESIGNING  OF  PIPELINES 


5.1.  Introduction 

In  this  chapter  we  introduce  some  concepts  for  designing  a 
pipeline.  We  shall  describe  various  options  a designer  can  exercise 
to  meet  his  goals  for  performance  and  cost.  The  number  of  available 
options  is  so  large  that  it  is  practically  impossible  to  formalize 
the  complete  design  procedure.  However,  some  formalization  does  exist 
at  a few  steps  in  the  design  procedure.  In  a multistep  procedure  where 
each  design  step  in  itself  is  a major  optimization  problem,  we  can  just 
hope  that  a design  conceived  by  locally  optimizing  at  each  step  is  close 
to  the  global  optimum.  A designer  must  make  judicious  decisions  at 
various  points  in  design  to  cut  down  on  the  number  of  possible  options 
and  make  the  problem  more  manageable.  We  propose  here  a top-down 
design  procedure.  At  every  point  in  the  design  we  try  to  achieve  a 
local  optimum.  To  bring  the  solution  closer  to  the  global  optimum  some 
backtracking  must  be  done.  How  much  effort  should  be  spent  on  back- 
tracking, depends  on  the  criticality  of  the  design  requirements,  for 
example,  cost,  execution  delay,  throughput,  etc.  In  the  following,  we 
have  relied  heavily  on  illustrations  to  discuss  each  design  step. 


161 


5.2.  A Design  Methodology 

We  start  out  with  the  given  function  or  task  which  is  to  be 

pipelined.  At  this  point  we  must  have  a specific  goal,  namely,  the 

throughput  which  must  be  met  or  exceeded,  the  cost  which  must  not  be 

exceeded  and  the  maximum  allowable  execution  delay  of  each  task. 

As  a running  example  let  the  function  to  be  pipelined  be 
2 

(aQ+a^x+a^  ^(bQ+b^y),  where  x and  y are  variables  and  a^'s  and  b^'s 
are  constants. 

The  first  step  in  the  design  procedure  is  the  formation  of 
a computation  graph,  variously  known  as  a precedence  graph  or  task 
graph.  Each  node  in  the  computation  graph  represents  an  operation  or 
subtask.  The  directed  edges  indicate  the  flow  of  information  as  well 
as  the  precedence  among  subtasks. 

A task  can  have  many  different  computation  graphs  for  two 
reasons.  The  first  is  that  there  are  many  different  algorithms  to 
compute  the  same  task.  The  second  is  that  there  are  many  ways  to 
define  subtasks  of  an  algorithm.  Choosing  a subtask  which  is  very 
small;  that  is,  the  operation  at  a node  in  the  graph  is  very  primitive, 
makes  the  graph  very  large. 

2 

For  our  task  z = (ag+a^+a^  )(bQ  + b1y),  some  of  the  ways 
to  compute  z are : 

1.  z = a0b0+(a0b1)y+a1(b0x)  + (a1b1)(xy)  + (a2bo)(x.x)  + ^ (b^)  (x-x) 

2.  z - <a0  +a]X  +a2  (x»x))  (bQ+  biy) 

“ (ao + t-ai + a2x^x)  (bQ + b^y)  • 


3. 


z 


L62 


For  the  same  task  some  of  the  subtasks  can  be  defined  as 

follows  : 

At  a very  low  level:  AND,  OR,  NOT  operations. 

At  a moderately  high  level:  Multiply  and  Add  operations. 

At  a higher  level:  Operation  which  computes  a polynomial  of  degree 

one;  e.g.,  P^a^a^x)  = a()+a1x. 

As  an  illustration.  Figure  5.2.1  shows  the  computation  graph 
of  expression  z = (a^  + [a^  + a^x] *x)  (b^  + b^y)  with  subtask  defined  as 
P^(a,b,c)  = a+bc.  Figure  5.2.2  shows  the  computation  graph  for  the 
same  expression  but  using  smaller  subtasks,  namely,  multiply  and  add. 

The  selection  of  subtasks  can  be  done  in  the  following  way. 
Suppose  the  performance  goal  is  one  task  every  t seconds.  Then,  select 
a subtask  which  can  be  executed  in  approximately  t/k  seconds,  where  k 
is  an  integer.  It  determines  the  degree  of  maximum  allowable  sharing 
of  any  segment,  that  is,  if  the  given  performance  goal  is  to  be  met, 
no  more  than  k computation  steps  can  be  assigned  to  any  one  segment. 

A large  k implies  a finer  definition  of  subtasks,  k should  be 
sufficiently  large  so  that  enough  flexibility  exists  later  in  the  design 
process  for  obtaining  desired  collision  characteristics.  However,  it 
should  not  be  so  large  that  the  resulting  complexity  of  the  reservation 
table  makes  the  problem  computationally  unmanageable.  Furthermore,  if 
the  design  is  to  use  components  already  available  in  the  market  then  k 
should  be  chosen  such  that  subtasks  conform  to  operations  available  in 
these  components.  Of  course  if  custom  made  components  are  to  be  used 
then  this  restriction  does  not  apply. 


Figure  5.2.2.  A computation  graph  of  function  Z. 


Suppose  in  our  example  the  performance  goal  is  one  task  every 
few  multiplication  times  then  clearly  the  choice  for  subtasks  is 
multiply  and  add  operations.  Both  of  these  operations  can  also  be 
performed  by  available  circuit  packages. 

We  have  yet  to  say  how  we  choose  an  optimum  algorithm  and 
what  optimum  means  here.  Clearly,  the  definition  of  an  optimum 
depends  on  the  goals  of  the  design.  Suppose  execution  delay  is  more 
critical  than  the  cost,  then  we  would  choose  an  algorithm  with  the 
maximum  parallelism,  so  that  we  can  do  as  much  computation  as  possible 
in  parallel  and  thus  reduce  the  execution  time  of  each  task.  If  the 
cost  was  of  major  importance  then  one  would  choose  an  algorithm  with 
the  fewest  number  of  total  subtasks  or  primitive  operations,  where  the 
subtasks  are  simple  and  require  little  hardware. 

It  should  be  pointed  out  that  selecting  the  proper  subtasks 
and  the  proper  algorithm  are  not  really  two  sequential  steps  because 
we  must  have  some  idea  about  the  algorithm  before  we  can  define  the 
subtasks  and  the  decomposition  into  subtasks  and  the  resultant  segment 
sharing  determines  cost  and  performance. 

For  our  example  let  us  choose  an  algorithm  with  the  fewest 
subtasks,  where  subtasks  are  defined  to  be  add  and  multiply  operations. 
Figure  5.2.2  is  a computation  graph  of  such  an  algorithm  which  requires 
3 adds  and  4 multiplys. 

The  next  step  in  the  design  procedure  is  the  choice  of 
functional  units  or  segments  to  perform  the  subtasks  of  the  computation 
graph.  The  choice  of  the  right  technology  and  devices  is  best  left  to 


166 


the  design  engineer,  because  probably  he  is  the  one  who  is  best  informed 
of  the  existing  state  of  the  art  in  devices  as  well  as  various  trade- 
offs between  speed,  cost,  reliability,  power  consumption,  layout, 
interconnections,  etc.  At  this  point  it  will  suffice  to  say  that  the 
designer  must  strike  a balance  between  the  speeds  of  various  segments 
because  equalizing  the  utilization  of  all  segments  tends  to  give  better 
performance  per  unit  cost. 

In  our  example,  multiply  is  the  bottleneck  operation  for  two 
reasons.  First,  there  are  more  multiply  operations  than  add  operations 
in  the  computation  graph.  Second,  multiply  is  a somewhat  slower 
process  than  add.  Consider  an  add  unit  which  adds  two  operands  in 
200  ns  and  a multiply  unit  which  multiplies  two  operands  in  say,  4000  ns. 
To  reduce  this  large  speed  gap  we  may  use  a fast  multiply  unit  which 
takes  say  2400  ns.  Since  multiply  still  remains  the  bottleneck  we 
may  choose  a slow  but  inexpensive  adder  which  takes  say  1200  ns.  One 
should  not  force  a balance  in  speeds  if  no  advantage  results.  Thus 
for  example,  the  2400  ns  multiplier  may  be  made  twice  as  fast  but 
only  possible  with  a 500%  increase  in  cost.  Moreover,  the  resulting 
increased  performance  may  be  much  over  the  established  goal.  Similarly 
for  example  making  the  1200  ns  adder  half  as  fast  may  not  reduce  the 
cost  significantly. 

Let  us  assume  for  our  illustration  that  the  add  segment  takes 
1 time  unit  and  the  multiply  segment  takes  2 time  units,  where  a time 
unit  is  say  1200  ns.  It  is  convenient  to  use  such  time  units  in  the 
reservation  table.  In  general,  we  take  a time  unit  equal  to  the 


167 


greatest  common  divisor  (gcd)  of  the  execution  times  of  all  segments. 
Since  every  segment  must  have  a latch  or  output  register,  let  us 
assume  that  execution  time  of  each  segment  also  includes  the  latch 
de lay . 

The  time  unit  chosen  in  this  manner  determines  the  basic  clock 
rate  of  the  pipeline.  Since,  all  execution  times  are  expressed  as  some 
integral  multiple  of  the  time  unit  and  since  all  data  transfers  occur 
synchronously  at  certain  multiples  of  the  basic  time  unit  we  choose  the 
period  of  the  clock  equal  to  the  basic  time  unit.  Then  a counter  or  a 
frequency  divider  can  be  used  to  generate  control  pulses  at  certain 
multiples  of  the  clock  period.  Clearly,  the  larger  these  multiples  the 
more  expensive  is  the  required  frequency  divider.  These  multiples  can 
be  made  smaller  if  the  clock  period  can  be  made  larger.  However,  since 
the  clock  perijd  is  fixed  by  the  gcd  of  execution  tin... t»  of  the  segments, 
the  only  way  to  increase  the  period  is  increase  the  gcd.  This  can  be| 
done  by  slightly  adjusting  the  execution  times  of  the  segments.  For 
example,  if  the  adder  execution  time  is  1201  ns  and  multiply  2400  ns 
then  gcd  (1201,2400)  is  1.  This  implies  a clock  period  equal  to  1 ns. 
However,  by  taking  the  execution  time  of  the  multiply  segment  equal  to 
2402  ns,  a minor  adjustment  of  2 ns,  we  get  gcd  (1201 ,2402)  = 1201  ns 
implying  a clock  period  of  1201  ns. 

One  more  reason  for  making  the  gcd  as  large  as  possible  by 
adjustment  of  segment  times  is  the  complexity  of  the  reservation  table. 
Again  take  the  example  of  adder  time  = 1201  ns  and  multiply  time  = 

2400  ns,  the  gcd  being  1.  Thus  one  add  step  will  be  represented  by 


168 


1201  X's  and  one  multiply  step  by  2400  X's.  However  if  the  add  and 
multiply  times  were  taken  1201  and  2402  ns  respectively  then  the  add 
step  can  be  represented  by  only  one  X and  the  multiply  step  by  2 X's. 

T.  us  making  the  gcd  large  helps  in  simplifying  the  reservation  table. 
Furthermore  it  is  unlikely  that  any  scheduling  of  the  pipeline  could 
exploit  the  2 ns  difference  in  mutiply  time  to  obtain  higher  performance. 
Such  minor  adjustments  in  segment  times  will  thus  tend  to  simplify  the 
pipeline  and  its  scheduling  without  sacrificing  much  if  any  performance. 

Once  the  basic  time  unit  is  determined,  we  are  in  a position 
to  start  assigning  various  subtasks  to  the  segments  and  enter  this 
information  in  a reservation  table.  We  start  out  by  forming  a straight 
through  pipeline  by  assigning  a distinct  segment  for  each  node  in  the 
computation  graph.  Then  we  allow  enough  sharing  of  the  segments  so 
that  the  lower  bound  average  latency  of  the  pipeline  is  no  greater  than 
the  target  average  latency.  To  illustrate  the  various  points  in  the 
design  process,  we  have  used  more  than  one  target  latency  in  the 
following. 

The  straight  through  reservation  table  for  task  Z of  the 

tree  type  computation  graph  in  Figure  5.2.2  is  shown  in  Figure  5.2.3a. 

M and  A in  the  figure  stand  for  multiply  and  add  respectively.  The 

partial  results  obtained  at  the  output  of  each  segment  are  also  shown 

on  the  right  side  of  the  table.  It  is  convenient  to  retain  some 

information  on  precedence  among  subtasks  in  the  reservation  table. 

2 

Therefore,  let  us  use  an  X for  subfunction  (ag+a^x+^x  ) and  a Y 
for  subfunction  (b^  + b^y)  and  rewrite  the  reservation  table  as  in 


(a2x  +ax) 

(a2x  +ax)  «x 
(a2x  + a1)-x  + aQ 

bly 

(bly+V 

(bly  +b0)^  (a2x-+a1).x4aQ] 


170 


Figure  5.2.3b,  with  the  understanding  that  X and  Y are  independent  of 
each  other  but  both  must  precede  Z. 

Let  us  assume  that  the  operands  can  be  made  available  in 
the  pipeline  wherever  and  whenever  they  are  needed  by  appropriate 
data  routing.  In  our  example  a^'s  and  b^ ' s are  constants  and  may  be 
supplied  internally  exactly  at  those  places  where  they  are  needed.  The 
variable  y used  only  once  starting  at  time  instant  0.  However,  the 
variable  x is  used  twice  in  the  pipeline,  once  during  time  unit  0 and 
once  during  time  unit  3.  The  operand  x must  be  taken  through  delay 
segments  so  that  it  is  available  again  at  time  instant  3.  We  have  not 
shown  any  delay  elements  in  the  reservation  table  to  keep  it  more 
readable.  Delay  elements  are  also  omitted  in  all  the  reservation  tables 
to  follow.  Once  the  reservation  table  is  finalized  and  made  allowable 
to  an  appropriate  cycle  we  can  assign  the  elemental  delays  to  non- 
compute segments  as  was  discussed  in  detail  in  Sections  2.4  and  2.9. 

Suppose  the  performance  goal  is  1/8  tasks  per  unit  time  or 
equivalently,  average  initiation  latency  8.  Suppose  it  is  required  that 
the  constant  latency  cycle  (8)  be  allowable.  The  lower  bound  of 
Figure  5.2.3b  is  2.  If  the  lower  bound  need  be  no  greater  than  8 then 
we  can  share  up  to  8 usages  of  any  segment.  In  Figure  5.2.4a  all 
multiply  operations  are  performed  by  segment  Sq  and  all  add  operations 
by  S^.  The  lower  bound  latency  of  Figure  5.2.4a  is  8.  This  table  can 
be  made  allowable  to  cycle  (8)  by  inserting  noncompute  segments. 
Algorithm  2.7.1  can  be  used  to  accomplish  this,  but  the  constraints 
must  be  properly  formed.  In  particular,  the  constraints  should  reflect 


171 


0 

i 

2 

3 

4 

5 

6 

7 

8 

So 

X 

X 

Y 

Y 

X 

X 

z 

z 

s, 

X 

Y 

X. 

(a) 


M 

A 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

II 

So 

X 

X 

X 

X 

Y 

Y 

z 

Z 

S, 

xj 

2 

Yl 

(b) 


Figure  5 .2.4.  Pipeline  wi.th  lower  bound  8 made  allowable  with 
respect  to  cycle  (8). 


the  independence  between  X and  Y and  dependence  of  Z on  X and  Y. 

Moreover,  the  indivisibility  of  the  multiply  step  should  also  be 
taken  into  account.  A minimum  delay  solution  is  given  in  Figure  5.2.4b. 
This  solution  has  only  one  multiplier  and  one  adder.  However,  this  is 
not  necessarily  the  minimum  cost  solution.  As  pointed  out  in  Chapter  2, 
the  computation  of  the  actual  cost  is  difficult,  primarily  due  to  lack 
of  information  on  the  data  routing  scheme.  In  this  example,  the  cost 
factors  consist  of  multipliers,  adders,  latches  and  multiplexers  required 
for  data  routing  and  interconnections,  as  well  as  control  logic.  A 
designer  must  determine  the  cost  of  control,  multiplexing,  and  inter- 
connection from  the  data  routing  and  scheduling  schemes  he  intends  to 
implement . 

Next,  suppose  the  performance  goal  is  one  task  every  4 time 
units.  Let  us  also  suppose  that  execution  delay  is  very  critical  and 
that  any  cycle  is  permissible  if  the  delay  can  be  kept  small.  We  must 
bring  down  the  lower  bound  of  Figure  5.2.4a  by  duplicating  or  subdividing 
the  multiply  segment  or  using  a faster  multiply  segment.  Suppose  we 
duplicate  the  multiply  segment.  There  are  three  distinct  ways  in  which 
four  multiply  operations  can  be  assigned  two  each  to  two  identical 
segments  Sq  and  • These  arrangements  are  shown  in  Figure  5.2.5  (based 
on  Figure  5.2.3b).  The  X's  and  Y's  are  placed  as  soon  as  possible  in 
these,  figures  and  are  not  placed  for  any  particular  collision 
characteristics.  To  obtain  a minimum  delay  solution  we  must  solve  for 
all  three  tables  and  over  all  cycles  with  average  latency  4,  which  is 
truly  an  impossibility!  Therefore  let  us  apply  some  heuristics. 


173 


0 

i 

2 

3 

4 

5 

6 

7 

(S) 

o 

X 

X 

Z 

z 

S, 

X 

Y 

X 

s2 

Y 

Y 

X 

X_ 

M 

A 

M 


(a) 


0 

i 

2 

3 

4 

5 

6 

7 

So 

X 

X 

X 

X 

S, 

X 

Y 

X 

CO 

r\> 

Y 

yj 

2 

z 

M 

A 

M 


(b) 


0 

1 

2 

3 

4 

5 

6 

7 

So 

X 

X 

Y 

Y 

s, 

X 

Y 

X 

s2 

y 

X 

z 

z 

. M 
A 
M 


(c) 


Figure  5.2.5.  Three  distinct  assignments  of  four  multiply 
operations  to  segments  Sq  and  . 


We  commented  in  Section  2.11  that  cycles  with  large  periods  are  less 
likely  to  produce  small  delay  solutions.  So  let  us  restrict  our  search 
for  an  optimum  solution  to  the  cycles  with  period  less  than  or  equal  to 
8,  and  of  course  with  average  latency  equal  to  4.  All  such  distinct 
cycles  are: 

(4),  (1,7),  (2,6),  (3,5). 

By  using  the  techniques  of  Section  2.9  we  determine  the  largest 
allowable  rows  for  these  cycles.  These  can  be  done  by  forming  the 
largest  compatibility  classes  of  these  cycles.  Such  classes  containing 
element  0,  for  each  cycle  are: 

cycle  (4):  class  {o, 1,2,3] 

cycle  (1,7):  class  {o,2,4,6} 

cycle  (2,6):  classes  (o, 1,4,5} ,{o,3,4,7} 

cycle  (3,5):  class  {o,2,4,6}. 

The  largest  rows  allowed  by  cycles  (1,7)  and  (3,5)  do  not 
have  any  two  consecutive  elements.  Thus  a multiply  operation,  which 
uses  two  consecutive  time  units  cannot  be  accommodated  in  the  largest 
allowable  row  of  these  cycles.  In  other  words,  since  multiply  is 
assumed  to  be  indivisible,  the  pipelines  of  Figure  5.2.5  cannot  be 
made  allowable  with  respect  to  cycles  (1,7)  and  (3,5).  The  pipelines 
can  be  made  allowable  with  respect  to  the  rest  of  the  cycles,  namely, 

(4)  and  (2,6).  A minimum  delay  solution  for  each  pipeline  of 
Figure  5.2.5  is  given  in  Figures  5.2.6  and  5.2.7,  respectively,  fojr 
cycle  (4)  and  cycle  (2,6).  The  minimum  delay  solution  among  all  these 


I 

I 

I 


^ * 


% 


I 

V 

F V 

« 

Figure  5.2.6. 

, ' S 

1 

’ 

fe  T 

’ I I 

i I I 

I I 

L ! 

9t 


CO  CO  CO 


0 

i 

2 

3 

4 

5 

6 

7 

So 

X 

X 

z\ 

z 

s, 

X 

Y 

X 

s2 

Y 

_Y_ 

X 

X 

M 

A 

M 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

X 

X 

7 

X 

X 

17 

IXl 

Lj 

X 

x 

_z 

z 

M 

A 

M 


0 1 23456789  10 


M 

A 

M 


Pipelines  of  Figure  5.2.5  made  allowable  with 
respect  to  cycle  (4). 


M 

A 

M 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

ion 

12  13 

So 

X 

X 

z 

ix 

s, 

_ 

X 

Y 

_XJ 

S2 

Y 

x 

tx 

X] 

(a) 


0 

1 

2 

3 

4 

5 

6 

7 

8 

So 

X 

X 

X 

X 

"3 

s, 

X 

Y 

X 

s2 

X 

Y 



x 

IT 

M 

A 

M 


(b) 


0 

1 

2 

3 

4 

5 

6 

7 

8 

So 

X1 

X 

y 

Y 

S, 

X 

X 

Y 

S2 

x 

x 

\z 

tz 

M 

A 

M 


(c) 


Figure  5.2,7.  Pipelines  of  Figure  5.2.5  made  allowable  with 
respect  to  cycle  (2,6). 


177 


pipelines  is  that  of  Figure  5.2.6a  in  which  the  added  execution  delay 
is  0.  Even  though  the  added  execution  delay  is  zero,  elemental 
delays  must  be  inserted  at  various  places  for  the  proper  flow  of  the 
operands.  Solutions  of  Figures  5.2.7b  and  5.2.7c  are  also  attractive. 

For  example,  data  routing  might  be  simpler  for  the  pipeline  of  Figure 
5.2.7c,  because  segment  Sq  has  only  one  source,  namely,  the  input  part 
of  the  pipeline  and  only  one  sink,  namely,  segment  S^. 

If  the  target  was  one  task  every  two  time  units  then  we  must 
replicate  some  segments  to  bring  the  lower  bound  to  2.  A minimum  delay 
pipeline  allowed  by  cycle  (2)  is  given  in  Figure  5.2.8.  The  added 
execution  delay  in  this  solution  is  zero. 

To  achieve  a throughput  of  one  task  every  time  unit  we  need 
to  divide  the  multiply  segment  so  that  the  two  time  units  of  the  multiply 
steps  can  be  separated  and  assigned  individually  to  two  different 
segments.  Ideally  the  division  of  a segment  should  divide  the  segment 
in  equal  parts,  that  is,  the  delay  in  each  part  is  the  same.  Such  a 
division  is  not  always  possible  or  practical  for  the  following  reasons. 

1.  If  the  original  segment  utilizes  some  internal  feedback,  for  example, 
when  the  algorithm  used  by  the  segment  is  iterative  such  a physical 
partition  of  the  segment  cannot  always  be  made.  In  particular  if  a 
feedback  path  lies  across  the  intended  dividing  boundary  then  no 
such  division  can  be  made. 

2.  After  a segment  is  divided  each  part  requires  a latch  or  output 
register  for  synchronization  purposes.  Since  latches  have  to  be 


179 


inserted  at  division  boundaries,  a segment  cannot  be  divided  if  it 
is  available  only  as  a single  integrated  circuit  package. 

3.  Additional  latches  required  after  the  division  of  a segment  add 

delay  to  computation  steps.  This  additional  delay  no  longer  permits 
the  use  of  the  original  pipeline  clock. 

Some  of  the  ways  to  circumvent  the  above  problems  are  as  follows. 

1.  Use  a different  algorithm  for  the  segment  so  that  internal  feedback 
paths  do  not  exist  across  a dividing  boundary. 

2.  Slow  down  the  pipeline  clock  to  include  the  additional  latch  delays. 
This  of  course  reduces  the  throughput  from  what  it  would  have  been 
if  the  clock  was  not  altered. 

3.  For  division  use  a faster  segment  to  compensate  for  the  additional 
latch  delays  and  thus  maintain  the  original  clock  timing. 

4.  Adjust  the  division  boundary  such  that  it  does  not  lie  across  a 
feedback  path  or  pass  through  an  integrated  circuit  package.  This 

of  course  may  make  uneven  division  of  segment  and  as  a result  require 
some  adjustment  in  clocking;  thus  degrading  the  performance  from 
what  it  would  have  been  with  ideal  division. 

5.  One  more  way  to  overcome  the  problem  of  feedback  and  integrated 
packages  is  to  partially  emulate  the  m-way  division  of  a segment  by 
multiplexing  m copies  of  the  segment.  Thus  for  example  division  of 

a segment  in  two  halves  can  be  emulated  by  two  copies  of  the  original 
segment, a demultiplexer  and  a multiplexer.  The  demultiplexer 
switches  the  incoming  tasks  alternately  between  the  two  segments 
and  the  multiplexer  remerges  their  outputs.  The  clock  will  have  to 


180 


be  adjusted  due  to  the  additional  delay  of  multiplexing/demulti- 
plexing. Such  an  arrangement  emulates  the  successive  usages  of  the 
parts  of  a divided  segment. 

In  our  example  let  us  suppose  that  we  have  succeeded  in 
dividing  the  multiply  segments  into  two  halves.  The  pipeline  resulting 
from  Figure  5.2.3b  with  lower  bound  one  is  given  in  Figure  5.2.9, 
where  and  are  respectively  the  first  and  second  half  of  the 
original  multiply  segment  M. 

Any  further  increase  in  throughput  requires  further  segment 
division.  However,  this  is  not  an  appropriate  procedure  to  achieve 
more  throughput.  Remember  that  the  choice  of  algorithm  and  definition 
of  subtasks  were  not  meant  for  a broad  range  of  throughput.  If  the 
performance  goal  was  very  high  then  we  should  have  started  with  a 
computation  graph  at  a lower  level  by  defining  the  subtasks  on  a finer 
scale . 

So  far  we  have  considered  only  single  function  pipelines. 
Moreover,  we  restricted  ourselves  to  using  only  fixed  cycles  for 
initiations.  A single  function  pipeline  can  tolerate  changes  in  : , 
arrival  pattern  by  a simple  addition  of  an  input  buffer.  The 
can  absorb  the  fluctuations  in  the  arrivals  and  convert  the-  . 
desired  initiation  cycle.  For  these  reasons  in  the  dtsn 
function  pipelines  there  is  no  need  for  greedy  contt 
buffers.  Both  these  alternatives  are  likely  to  be  . - 

than  use  of  noncompute  segments.  However,  in  tt  . * . 

pipelines  noncompute  segments  are  not  alway-^  , 


181 


1 

2 

I. 

2 

1 

'2 

l| 

2 

in  a lower  bound  of  1. 


182 


reasons,  such  as  the  difficulty  in  finding  a good  initiation  cycle  and 
inability  to  process  random  arrivals.  Therefore  we  shall  need  some 
alternatives  to  noncompute  segments  for  the  design  of  multifunction 
pipelines . 

Multifunction  pipelines:  The  same  considerations  apply  in 

the  formation  of  a computation  graph  as  those  for  a single  function 
pipeline.  Next  we  form  the  reservation  tables  and  use  an  appropriate 
control  mechanism  as  follows. 

If  the  arrivals  are  well  defined  so  that  an  initiation  cycle 
can  be  formed  then  we  can  use  either  noncompute  segments  or  internal 
buffers.  In  either  case  the  reservation  table  can  be  formed  so  as  to 
meet  or  slightly  exceed  the  required  throughput.  If  the  reservation 
table  cannot  be  made  allowable  with  the  use  of  noncompute  segments  or 
if  it  can  be  made  allowable  only  with  a very  high  delay  or  cost  then 
we  should  use  internal  buffers  in  the  manner  of  Chapter  3.  A simulation 
should  be  carried  out  to  determine  a good  priority  scheme  and  the 
actual  buffer  sizes  required  in  the  implementation. 

If  the  arrivals  are  nondeterministic  then  we  use  either  greedy 
control  or  internal  buffers  depending  on  the  cost-effectiveness.  If 
the  cost  of  a single  buffer  is  comparable  to  the  cost  of  a segment,  as 
is  the  case  when  the  segment  is  made  of  only  a few  logic  gates  then  the 
greedy  control  is  likely  to  be  far  more  cost-effective  than  the 
internal  buffers  as  explained  below. 

If  greedy  control  is  chosen  then  the  reservation  table  should 
be  formed  so  that  the  potential  throughput  is  considerably  higher  than 


183 


the  specified  goal.  Some  of  our  simulations  on  a few  specific 
examples  have  shown  that  a workload  of  50%  to  60%  Is  processable  with 
greedy  control.  Thus  as  an  Initial  step  the  reservation  table  may  be 
formed  such  that  its  potential  throughput  is  about  twice  the  required 
throughput.  A simulation  should  then  be  carried  out  on  this  reserva- 
tion table  and  throughput  adjusted  (by  segment  replication  or  division 
and  sharing)  accordingly  until  a satisfactory  design  is  obtained. 

The  description  of  the  actual  control  mechanism  for  a greedy  strategy 
can  be  found  in  [TH076] . 

If  the  costs  of  a segment  and  a buffer  are  comparable,  then 
the  cost  of  duplicating  a segment  is  also  comparable  to  that  of  adding 
a buffer.  Since  in  our  implementation  of  internal  buffers,  we  must  have 
at  least  one  buffer  per  computation  step,  the  cost  of  internal  buffers 
may  far  exceed  the  cost  of  duplicating  or  even  triplicating  all 
segments.  Thus  even  if  greedy  control  requires  the  pipeline  to  have 
potentially  twice  or  thrice  the  required  throughput,  it  can  be  more 
cost-effective  than  the  use  of  internal  buffers. 

When  the  cost  of  a single  buffer  is  relatively  low  compared 
to  that  of  a segment  we  may  choose  the  internal  buffers  over  greedy 
control.  Here  also,  we  must  form  the  reservation  table  with  a potentially 
higher  throughput  than  the  specified  goal,  since  this  choice  reduces 
the  average  wait  time  of  a task  and  also  reduces  the  required  buffer 
sizes  as  was  indicated  in  Chapter  4.  As  pointed  out  in  Section  4.4, 
it  is  advisable  to  carry  out  a simulation  to  determine  a good  priority 
and  appropriate  buffer  sizes. 


i*4|N 


,•»  T(, 

V 


184 

Designing  pipelines  with  more  capacity  than  required  is  not 
exclusive  to  multifunction  pipelines  with  greedy  control  or  internal 
buffers.  Sometimes  in  the  case  of  single  function  pipelines  it  helps 
to  design  a pipeline  with  more  capability  than  required,  that  is,  a 
pipeline  with  lower  bound  less  than  the  required  average  latency. 

One  reason  for  this  is  that  segments  may  not  be  fully  utilized  and  hence 
there  is  more  freedom  to  move  the  usages  (X's)  of  a segment  so  that  the 
pipeline  becomes  allowable.  In  many  instances,  this  increase  in  freedom 
results  in  a small  delay  solution.  Another  reason  for  designing  a 
single  function  pipeline  with  higher  capability  than  required  is  so 
that  the  cost  of  data  routing  might  be  low.  This  happens  because  the 
complexity  of  data  routing  increases  faster  than  linearly  with  the 
increase  in  segment  usages.  Thus  beyond  a certain  point  the  savings 
in  the  cost  of  segments  might  be  more  than  offset  by  the  increase  in 
the  cost  of  data  routing.  The  quantification  of  the  foregoing 
observations  is  very  difficult  in  the  general  case  and  we  shall  not 
attempt  it  here.  The  following  is  a summary  of  the  design  methodology 
presented  in  this  section. 

5.3.  Summary 

1.  Choose  an  algorithm  (or  algorithms  in  case  of  multifunction 
pipelines)  for  the  task  to  be  pipelined. 

2.  Define  the  subtasks  based  on  thfe  performance  requirements,  technology 
and  availability  of  modules. 


185 


3.  Form  the  computation  graph  of  the  algorithm  using  the  subtasks 
defined  in  step  2. 

4.  Iterate  steps  1 through  3 to  obtain  a computation  graph  with 
desired  properties  (e .g.,  shortest  execution  time  or  fewest  number 
of  subtasks  or  distinct  subtasks, etc .) . 

5.  Form  a preliminary  reservation  table  from  the  computation  graph. 

6a.  [For  all  single  function  pipelines] . Modify  the  reservation  table 

by  segment  replication  or  division  so  as  to  meet  or  slightly  exceed 
the  required  throughput  rate,  that  is,  obtain  a lower  bound  latency 
< the  required  average  latency.  Form  initiation  cycles  with  the 
required  average  latency.  Restrict  the  cycles  to  small  periods. 
Form  allowability  and  precedence  constraints  as  described  in 
Chapter  2.  Use  algorithm  2.7.1  to  obtain  all  possible  minimum 
delay  solutions.  Assign  the  elemental  delays  to  noncompute 
segments  as  in  Section  2.4  or  2.9.  Choose  a solution  having  the 
desired  characteristics  (e .g., minimum  cost , minimum  delay,  etc.). 

6b.  [For  multifunction  pipelines  with  a well-defined  arrival  pattern] . 
Follow  exactly  the  same  procedure  as  in  6a.  If  the  reservation 
table  cannot  be  made  allowable  or  if  it  can  be  made  allowable  only 
with  very  high  cost  and  delay  then  use  internal  buffers.  See 
Chapter  3 for  specific  details  on  internal  buffers  with  periodic 
arrivals. 

6c.  [For  multifunction  pipelines  with  nondeterministic  arrivals] . 

Modify  the  reservation  table  with  segment  replication  or  division 
such  that  the  maximum  possible  throughput  is  considerably  higher 


186 


i 

I 

I 

than  the  required  throughput.  Uae  greedy  control  If  the  cost  of 
f * segment  is  comparable  to  that  of  a buffer.  Otherwise  use  Internal 

| buffers  of  appropriate  size  and  priority  using  Chapter  4 as  a 

guide.  In  either  case  simulations  should  be  carried  out  before 
I arriving  at  a final  reservation  table. 

Even  though  the  above  procedure  is  given  as  a top-down 
design  it  is  advlslble  to  backtrack  at  several  places  for  a better 
solution.  Moreover,  the  cost  tradeoff  must  be  considered  to  compare 
segment  replication  and/or  finer  segmentation  with  the  use  of  non- 
compute segments  or  Internal  buffers. 

II 
f 

ft 

I 

I 

I 

I 

I 

I 

I 

t 


I 


187 


I 

I 

I 

I 

I 


Chapter  6 

CONCLUDING  REMARKS 

6.1.  Conclusions  and  Summary  of  Results 

We  have  shown  that  the  theoretical  maximum  throughput  of  a 
pipeline  can  be  achieved  with  the  use  of  noncompute  segments  or  Internal 
buffers.  We  have  allowed  a considerable  degree  of  freedom  in  the 
choice  of  an  initiation  cycle  for  achieving  the  maximum  throughput. 

We  have  also  shown  that  even  with  nondeterministlc  arrivals  the  use  of 
internal  buffers  results  in  a near  maximum  throughput.  The  specific 
results  can  be  summarized  as  follows. 

The  allowability  characteristics  of  cycles  and  pipelines 
have  been  presented  in  considerable  detail.  It  was  established  that 
modulo  p equivalents  of  various  quantities  were  sufficient  when  dealing 
with  allowability,  where  p is  the  period  of  an  initiation  cycle.  This 
resulted  in  a substantial  reduction  in  the  complexity  of  the  theory. 

The  concept  of  compatibility  classes  of  a cycle  was  found  to  be  very 
useful.  From  the  compatibility  classes  one  can  determine  the  structure 
of  all  allowable  pipelines  of  a given  cycle. 

We  showed  that  under  certain  conditions  a pipeline  can  be 
made  allowable  with  respect  to  some  cycle  with  insertion  of  delays. 

For  a single  function  pipeline,  it  is  always  possible  to  add  delays  to 
it  and  make  it  allowable  with  respect  to  a constant  latency  cycle  having 
a latency  equal  to  or  greater  than  the  lower  bound  of  the  pipeline.  We 
have  developed  systematic  and  effective  procedures  to  Insert  delays  in 
a pipeline  to  make  it  allowable  for  a given  cycle. 


♦ • * - 


mrmm 


188 


We  also  investigated  the  effect  of  internal  buffers  in 
pipelines.  Several  different  priority  schemes  were  considered  for 
resolving  collisions.  For  priority,  LIFO-global  and  any  Precedence- 
Implied  Priority  it  was  shown  that  any  cycle  was  executable  with  bounded 
queues  and  bounded  wait  time,  if  the  cycle  implied  a workload  no 
greater  than  1007..  Thus  a great  degree  of  freedom  in  the  choice  of 
a cycle  was  achieved  with  internal  buffers.  We  also  presented  the 
upper  bounds  on  the  queue  size  and  wait  times  for  the  above  mentioned 
priorities.  These  bounds  were  found  to  be  proportional  to  the  number 
of  initiations  in  a cycle. 

When  the  arrivals  were  not  assumed  periodic  we  found  by 
simulation  that  internal  buffers  were  still  very  effective.  A near 
maximum  throughput  was  achieved  in  nearly  all  experiments  using 
randomly  structured  pipelines.  We  also  experimentally  studied  the 
dependence  of  wait  time  on  buffer  size,  workload  and  priority. 

Finally  we  presented  a methodology  to  design  pipelines  with 
noncompute  segments  or  internal  buffers. 

The  results  of  this  research  are  applicable  to  any  pipeline 
which  conforms  to  our  assumed  model.  Many  computations  can  be  pipelined 
using  our  model.  Notably,  the  arithmetic  units  of  computers  can  be 
effectively  designed  and  controlled  using  the  techniques  presented  in 
this  work.  Many  special  purpose  computations,  such  as  Fast-Fourier- 
Transform,  can  also  be  effectively  pipelined,  while  some  existing  pipe- 
lines can  be  modified  to  increase  their  throughput. 


189 


\ 


Finally,  the  pipelines  modeled  here  are  generally  applicable 
beyond  computational  systems,  for  example  to  job  shop  and  flow  shop 
design  and  scheduling. 


I 

I 


6.2.  Suggestions  for  Further  Research 

Although  we  did  cover  a considerable  number  of  properties 
of  cycles,  there  is  still  need  to  refine  some  of  the  theory.  As 
yet  a simple  method  to  test  the  perfectness  of  a cycle  does  not  exist. 
Perfect  cycles  are  useful  since  they  allow  1007o  utilization  of  pipelines. 

During  the  design  of  a pipeline  in  Chapter  5,  we  took  a few 
cycles  with  small  periods  and  made  the  pipeline  allowable  for  each  of 
these  cycles  and  then  we  chose  the  best  solution.  We  had  to  try  several 
cycles,  since  we  do  not  have  any  criterion  to  determine  whether  a 
particular  cycle  is  going  to  have  a good  solution.  Even  the  advice  to 
use  cycles  with  small  periods  is  heuristic  at  best.  Thus  work  remains 
to  be  done  in  finding  an  efficient  algorithm  to  construct  a "good" 
cycle,  given  a pipeline,  so  that  when  the  pipeline  is  made  allowable  for 
that  cycle,  it  requires  the  minimum  possible  added  delay. 

In  the  case  of  internally  buffered  pipelines  with  periodic 
initiations,  the  priorities  FIFO-global,  MPF  and  LWRF  still  remain  to  be 
solved.  Two  conjectures  for  Precedence-Implied  Priority  also  need  to 
be  proven. 

Some  work  needs  to  be  done  at  the  implementation  level  of  the 
pipeline.  Specifically,  a systematic  procedure  is  needed  to  establish 


a cost-effective  data-routing  scheme  in  the  pipeline.  Such  a 
procedure  is  necessary  to  form  a complete  cost-model  of  the  pipeline, 
since  a considerable  portion  of  the  total  cost  of  a complex  pipeline 
is  the  cost  of  data-routing. 

Almost  every  step  of  the  design  procedure  of  Chapter  5 could 
be  made  more  rigorous  and  less  heuristic.  Each  step,  however,  is  a 
major  research  problem  in  itself. 


LIST  OF  REFERENCES 


AND67a  D.  W.  Anderson,  F.  J.  Sparacio  and  R.  M.  Tomasulo,  "The  IBM 
System/360  Model  91:  Machine  Philosophy  and  Instruction- 

Handling,"  IBM  Journal  of  Res,  and  Dev.,  pp.  8-24,  Jan.  1967. 

AND67b  S.  F.  Anderson,  J.  G.  Earle,  R.  E.  Goldschmidt  and  D.  M.  Powers 
"The  IBM  System/360  Model  91:  Floating-Point  Execution  Unit," 

IBM  Journal  of  Res,  and  Dev.,  pp.  34-53,  Jan.  1967. 

BON69  P.  Bonseigneur,  "Description  of  the  7600  Computer  System," 
Computer  Group  News,  pp.  11-15,  May  1969. 

CHE71  T.  C.  Chen,  "Parallelism,  Pipelining  and  Computer  Efficiency," 
Computer  Design,  pp.  69-74,  Jan.  1971. 

DAV71  E.  S.  Davidson,  "The  Design  and  Control  of  Pipelined  Function 
Generators,"  Proc.  1971  Inti.  IEEE  Conf.  on  Systems,  Networks, 
and  Computers , Oaxtepec,  Mexico,  Jan.  1971. 

DAV74  E.  S.  Davidson,  "Scheduling  for  Pipelined  Processors,"  Proc. 

7th  Annual  Hawaii  Inti,  Conf.  on  Systems  Sciences,"  pp.  58-60, 
Jan.  1974. 

DAV75  E.  S.  Davidson,  L.  E.  Shar,  A.  T.  Thomas  and  J.  H.  Patel, 
"Effective  Control  for  Pipelined  Computers,"  Proc.  Compcon 
Spring  75.  pp.  181-184,  Feb.  1975. 

DOR67  N.  M.  Dor,  "Guide  to  the  Length  of  Buffer  Storage  Required  for 
Random  (Poisson)  Input  and  Constant  Output  Rates,"  IEEE  Trans . 
Comput . , Vol.  EC-16,  pp.  683-684,  Oct.  1967. 

FLY72  M.  J.  Flynn,  "Some  Computer  Organizations  and  their  Effective- 
ness," IEEE  Trans . Comput . . Vol.  C-21,  pp.  948-960,  Sept.  1972. 

HAL72  T.  G.  Hallin  and  M.  J.  Flynn,  "Pipelining  of  Arithmetic 

Functions,"  IEEE  Trans . Comput . . Vol.  C-21,  pp.  880-886,  Aug. 

1972. 

KIN72  R.  G.  Hintz  and  D.  P Tate,  "Control  Data  STAR-100  Processor 
Design,"  Proc.  Compcon  Fall  72,  pp.  1-4,  Sept.  1972. 

IBB72  R.  N.  Ibbett,  "The  MU5  Instruction  Pipeline,"  The  Computer 
Journal,  Vol.  15,  No.  1,  pp.  42-50,  Feb.  1972. 

KNU73  D.  E.  Knuth,  The  Art  of  Computer  Programming,  Vol,  1, 

Fundamental  Algorithms.  2nd  ed.,  Addison-Wesley,  Reading,  Mass. 

1973. 


192 


KNU75  D.  E.  Knuth,  "Estimating  the  Efficiency  of  Backtrack  Programs," 
Mathematics  of  Computation,  Vol.  29,  No.  129,  pp.  121-136, 

Jan.  1975. 

LAR73  A.  G.  Larson  and  E.  S.  Davidson,  "Cost-Effective  Design  of 
Special-Purpose  Processors:  A Fast  Fourier  Transform  Case 

Study,"  Proc ■ 11th  Annual  Allerton  Conf.  on  Circuit  and  System 
Theory , pp.  547-557,  Oct.  1973. 

MAY75  W.  Mayeda,  "Two  Methods  for  Improving  Throughput  of  a Pipeline," 
Proc.  13th  Annual  Allerton  Conf.  on  Circuit  and  System  Theory, 
pp.  873-886,  Oct.  1975. 

PAR74  B.  Parasuraman,  "Pipelined  Architectures  for  Microprocessors," 
Proc.  Compcon  Fall  74,  pp.  225-228,  1974. 

PHI70  G.  Philokyprov  and  S.  Tzafestas,  "Buffer  Size  and  Waiting-Time 
Calculations  for  Random  Arrivals  and  Periodically  Regular 
Service,"  Electronics  Letters.  Vol.  6,  No.  18,  pp.  588-590, 

Sept.  1970. 

RED73  S.  S.  Reddi  and  C.  V.  Ramamoorthy,  "A  Scheduling  Problem," 

Operational  Res.  Quarterly.  Vol.  24,  No.  3,  pp.  441-446,  Sept. 
1973. 

SHA72  L.  E.  Shar,  "Design  and  Scheduling  of  Statically  Configured 

Pipelines,"  Tech.  Report  No.  42,  Digital  Systems  Lab.,  Stanford 
Univ.,  Sept.  1972. 

SHA74  L.  E.  Shar  and  E.  S.  Davidson,  "A  Multimini-Processor  System 
Implemented  through  Pipelining,"  Computer , Vol.  7,  No.  2, 
pp.  42-50,  Feb.  1974. 

STE73  C.  M.  Stephenson,  "Control  of  a Variable  Configuration  Pipe- 
lined Arithmetic  Unit,"  Proc.  11th  Annual  Allerton  Conf.  on 
Circuit  and  System  Theory,  pp.  558-567,  Oct.  1973. 

ST073  H.  S.  Stone,  Discrete  Mathematical  Structures  and  their 

Applications , Science  Research  Associates,  Chicago,  1973. 

TH074  A.  T.  Thomas  and  E.  S.  Davidson,  "Scheduling  of  Multicon- 
figurable  Pipelines,"  Proc.  12th  Annual  Allerton  Conf.  on 
Circuit  and  System  Theory,  pp.  658-670,  Oct.  1974. 

TH076  A.  T.  Thomas,  "Scheduling  of  Multiconf igurable  Pipelines," 

Tech.  Report,  Coordinated  Science  Lab.,  Univ.  of  Illinois-Urbana, 
1976. 


193 


TOM67  R.  M.  Toraasulo,  "An  Efficient  Algorithm  for  Exploiting  Multiple 
Arithmetic  Units,"  IBM  Journal  of  Res,  and  Dev.,  pp.  25-33, 

Jan.  1967. 

WAT72  W.  J.  Watson,  "The  TI  ASC  - A Highly  Modular  and  Flexible 
Computer  Architecture,"  Proc.  FJCC  1972,  pp.  221-228. 

WIN73  A.  K.  Winslow,  "Task  Scheduling  in  a Class  of  Pipelined 

Systems,"  Report  R-633,  Coordinated  Science  Lab.,  Univ.  of 
Illinois-Urbana,  Nov.  1973. 


